R sucks

I’m doing an analysis and one of the objects I’m working on is a multidimensional array called “attitude.” I took a quick look:

> dim(attitude)
[1] 30  7

Huh? It’s not supposed to be 30 x 7. Whassup? I search through my scripts for a “attitude” but all I find is the three-dimensional array. Where did this 2-way array come from? I take a look:

> attitude
   rating complaints privileges learning raises critical advance
1      43         51         30       39     61       92      45
2      63         64         51       54     63       73      47
3      71         70         68       69     76       86      48
4      61         63         45       47     54       84      35
5      81         78         56       66     71       83      47
6      43         55         49       44     54       49      34
7      58         67         42       56     66       68      35
8      71         75         50       55     70       66      41
9      72         82         72       67     71       83      31
10     67         61         45       47     62       80      41
11     64         53         53       58     58       67      34
12     67         60         47       39     59       74      41
13     69         62         57       42     55       63      25
14     68         83         83       45     59       77      35
15     77         77         54       72     79       77      46
16     81         90         50       72     60       54      36
17     74         85         64       69     79       79      63
18     65         60         65       75     55       80      60
19     65         70         46       57     75       85      46
20     50         58         68       54     64       78      52
21     50         40         33       34     43       64      33
22     64         61         52       62     66       80      41
23     53         66         52       50     63       80      37
24     40         37         42       58     50       57      49
25     63         54         42       48     66       75      33
26     66         77         66       63     88       76      72
27     78         75         58       74     80       78      49
28     48         57         44       45     51       83      38
29     85         85         71       71     77       74      55
30     82         82         39       59     64       78      39

Ummmm, wha? Is it an example I used in class? I don’t recall any such dataset, then I remember I just recently restarted R so it can’t be anything from my class anyway. I google *R attitude* and find that it’s one of the preprogrammed examples in R, one of those long-dead datasets that they like to include in R, really part of the Bell Labs and Tukey tradition of demonstrating methods on super-boring old data (remember the airport temperature in Yuma, Nevada, from the EDA book?).

OK, fine, this is annoying, I’ll delete it:

> rm(attitude)
Warning message:
In rm(attitude) : object 'attitude' not found

That’s just perverse. OK, I’ll overwrite it:

> attitude=0
> attitude
[1] 0

That works. Now I’ll remove it for real:

> rm(attitude)

Now let’s check that it’s really gone:

> attitude
   rating complaints privileges learning raises critical advance
1      43         51         30       39     61       92      45
2      63         64         51       54     63       73      47
3      71         70         68       69     76       86      48
4      61         63         45       47     54       84      35
5      81         78         56       66     71       83      47
6      43         55         49       44     54       49      34
7      58         67         42       56     66       68      35
8      71         75         50       55     70       66      41
9      72         82         72       67     71       83      31
10     67         61         45       47     62       80      41
11     64         53         53       58     58       67      34
12     67         60         47       39     59       74      41
13     69         62         57       42     55       63      25
14     68         83         83       45     59       77      35
15     77         77         54       72     79       77      46
16     81         90         50       72     60       54      36
17     74         85         64       69     79       79      63
18     65         60         65       75     55       80      60
19     65         70         46       57     75       85      46
20     50         58         68       54     64       78      52
21     50         40         33       34     43       64      33
22     64         61         52       62     66       80      41
23     53         66         52       50     63       80      37
24     40         37         42       58     50       57      49
25     63         54         42       48     66       75      33
26     66         77         66       63     88       76      72
27     78         75         58       74     80       78      49
28     48         57         44       45     51       83      38
29     85         85         71       71     77       74      55
30     82         82         39       59     64       78      39

Damn. This is really stupid. Sure, I can understand that R has some pre-loaded datasets, fine. But to give them these indelible names, that’s just silly. Why not just make the dataset available using a “library” call? It’s a crime to pollute the namespace like this. Especially for those of us who work in public opinion research and might want to have a variable called “attitude.”

Yes, when I define my own “attitude” variable, the preloaded version in R is hidden, but the point is that to have such a variable sitting there is just asking for trouble.

P.S. R is great. It is because R is so great that I am bothered by its flaws and I want it to be even better.

146 thoughts on “R sucks

  1. What really sucks is the dumb user that doesn’t understand how R works, and instead of recognizing that he/she prefers to blame R. The documentation of that data set in R accounts for a data.frame with 30 rows and 7 variables, check for example:
    https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/attitude.html
    And it is well-known that if the user defines an object with the same name of an already existing (or preloades) object in R then R gives preference to the user defined object, buy it is not “overwritten” as the dumb user believes. So there is nothing wrong with R in this example.

    • Arturo:

      Not all that is well-known to you is well-known to simple users such as myself. But indeed I did know that I have to be careful if I want to overload pre-existing word in R such as “matrix” or “lm” or “read” or whatever. What I didn’t know is that the word “attitude” would be taken! And why—because it’s used in a pre-loaded dataset on some application I’ll never care about? That’s not good. That’s just R laying a land mine that will blow up on unsuspecting users, as it did for me.

      To have a useless object that I can’t just delete, that’s annoying. But, hey, you probably like Clippy, too!

        • Sam:

          I’ve been pretty careful to always use TRUE and FALSE, not T and F, and, just to be safe, to avoid giving variables the names T and F. I do sometimes use “for (t in 1:T)”, though.

        • I tried using a script you posted a bit back (the 1000 simulated temperature series), and it messed up my R session when all sorts of variables now equalled 135 instead of true. Took awhile to discover the problem was the line T=135 right at the top of the script I ran in passing.

        • How about this one: “(” <- sum . Yes, parentheses are functions that can be assigned. Actually, you can also reassign the assignment operators too.

      • Dear Andrew: I´m nobody, so whatever I post it really doesn’t matter, but you are a celebrity in the Statistics world, thousands of people follow and consider what you say, and if Andrew Gelman says “R sucks” you may hurt one of the most relevant and successful stasticial computing projects. Cheers!

        • Nice, but the wrong message has already been delivered. For example, Twitter user @afrakt who is an economist with more than 10 thousand followers posted “I don’t use R, and not for this reason, but I know people who do. Here’s a warning” and cited your Twitter post “R sucks”. This guy is not an R user, but clearly believes what you say, and for him if Andrew Gelman sais “R sucks” then he retweets that to his 10K followers. This is the kind of reactions I was afraid of… too late :-(

        • Let’s be real, as a longtime R user, the R language does suck and should be called out for it.

          Namespace pollution because old-timey statisticians would get too confused when they have to type one line to load an example dataset is only one of many problems.

          The original design choices assume there’s a human in the loop to figure out the multitude of needless type errors due to weak language design and base functions. The consequence is R code doesn’t scale well and is inherently non-robust.

        • This is a characteristic of a lot of long-lived projects: When they started out they were indeed targeting simple users & smaller projects and not many envisioned the sophisticated use cases of today.

          Changing anything drastically in the design is a big pain because a lot of legacy code could break. If one wants the entirely dump R as a user in favor of a nicer language the alternatives aren’t great.

          I guess there’s no easy solutions here.

        • Andrew:

          Sometimes you go overboard with those clickbait titles, methinks.

          What you really mean isn’t what’s delivered by your blog post titles.

      • You Don’t have to be careful about `matrix` or `lm` etc if using them as functions because R is clever enough to know that the matrix `matrix` you created probably isn’t want you are referring to when you call `matrix(foo)` (unless you overwrite those functions with your own functions). Clashing of names of functions is a growing problem and something that users may start having to contend with, especially where package authors intentionally mask (they don’t overwrite the original) a function shipped with R.

        The data issue you mention has been a newer feature of R for some years. Before then you needed an explicit `data(attitude)` call to load a named data set from the datasets package. Now with lazy loading of data objects from packages, data sets in the datasets package can be referred to without the `data()` call. But your problem wouldn’t have happened if you hadn’t tried to access your `attitude` before it was created, and would have gone away if you’d just gone ahead and created your `attitude` before trying to access it.

        Can you delete `matrix` or `lm`? No. So I’m not sure why you expected `attitude` or anything else not defined in your global environment to work any differently.

        • Exactly. The problem seems to be that Andrew didn’t actually create the array he thought he did. Is that R’s fault? I call my data data all the time and appreciate the fact that R can distinguish between the operator data() and the object I called data.

        • I may be wrong but people expect to see the default namespace cluttered with functions but not data objects? e.g. nobody is generally surprised when, say, a sum() max() etc. is pre-loaded.

          But without any explicit action I see a “cities” object or “temperatures” data structure that is treated as more unexpected.

          I recall, in some languages pre-created variables at least follow a special naming convention so that the user cannot inadvertently tread on them. e.g. __attitude__ etc.

        • @Rahul the default namespace isn’t cluttered with these data objects. Those objects are all nicely contained in the namespace of the datasets package. That anyone sees these things is just due that namespace being searchable as it is on the search path. You can easily change this if you don’t like it.

          Also, a user cannot inadvertently tread on these objects; they are sealed in the namespace. A user can mask things with their own objects but that has always been the case; nothing new here.

          This entirepost & thread is quite pointless; it’s only Andrew’s popularity that lends it any weight and what he finds sucks is a trivial issue about being able to find an object with the same name as something he thought he’d created but actually hadn’t.

          Can we all just move along now please?

        • Gavin:

          You can call this trivial if you want, but for better or for worse I’m not the only one who writes R scrips and then cuts and pastes code out of order, sometimes inadvertently trying to access variables that have not been created.

          It’s just weird for a general-purpose statistics package to have a pre-loaded variable called “attitude” that corresponds to some preloaded dataset that nobody would ever care about. Yes, we can all work around it. But it’s weird.

          Or, to put it another way, I’m not an R developer, I’m an R user. I’m sharing the user’s perspective, and a lot of people find that perspective helpful.

        • @Andrew the problem is that you’re characterizing this issue as something that needs to be “worked around” when the reality is that if you had created the object you intended everything would have worked just fine. The only work around needed is if you wanted to access the attitude data set after you had created your own object. The bottom line is the error was yours not R – it’s just that through your error you discovered that R has a named object of which you were previously unaware!

        • Anon:

          That’s the point: I made a mistake, and I make mistakes all the time. I would like my programming language to be robust to mistakes. This is not always possible, but in this case the failure was unnecessary because it involved an object of which I was completely unaware and which I would never have occasion to use, just sitting there waiting to cause problems!

        • @Andrew (again) :-) I actually agree with you. It would be great if programming languages could minimize our mistakes. In this case it is a contest between avoiding two mistakes. One mistake is a user accidentally calling a named object when he intended to call an object that he forgot to load. Another mistake is a user being unable to run an example because a data file hasn’t been loaded. In the latter case the error message would be clear and unambiguous (object so and so not found). In your case you get an error message distantly related to the problem (at best) or ambiguous results (at worst). The first mistake (yours) probably happens less frequently than the second but the potential for serious error seems greater. So I do agree with you that it’s not a good idea to load example data sets by default. But this is a trivial issue relative to the one of namespaces that you seemed to be complaining about (and easy to address as described in a number of posts). The reason I think your post received a lot of negative response is that rather than asking to understand the issue, you immediately attacked the way namespaces are handled with an extremely inflammatory title. I hope you understand now the issue is not one of namespaces per se but that of which objects are loaded by default. And BTW, I bet if you asked 100 people which functions should be loaded by default (in base) you’d get more than a few answers … fundamentally that is the issue.

        • I think the basic principle is that programming languages should minimize reserved words and at this point it would be better if the dataset have to be there to either always call them by datasets:: or rename them all with a _ at the beginning.

        • Many programmers and analysts enjoy seeing reasonable error messages as a result of mistakes. In this case instead of an error message you (silently!) get a dataset unrelated to your work. I think it’s fair to say that is a poor error message. As pointed out below R could be more judicious in its use of pre-defined pseudo-reserved names.

        • No, I think test datasets should only be loaded by explicit calls from the user. Cluttering up the namespace like this is in itself a bad idea.

        • I agree with Rahul above. R does have an ‘attach’ command to load such data sets. IMHO the problem is that the hidden datasets are hidden variables in the namespace. For reproducibility, I start most of my scripts with a clean namespace (rm(list=ls()). The problem is that these datasets are hidden variables, and as Andrew noted, a recipe for a frustrating debugging session. My answer to “Anonymous” above is that I expect the standard methods to be in the namespace. I don’t want to go the route of Java, C, or Python where one has to import virtually everything. But I do expect to have to specifically load even an example dataset into the global environment. Especially when memory becomes an issue…

          I would also take issue with Arturo’s ad-hominem attack earlier. Andrew has shown himself to be a careful experimenter and his issue should not be dismissed out of hand. We are all free to agree or disagree, but a charitable attitude is more helpful to get a listening ear.

      • attitude <- data.frame(matrix(1:100, 10));

        lm <- 31;
        with(attitude, lm(X1 ~ X2 + X3));
        lm

        Cool thing…You can overwrite attitude with a new data set and likewise lm as you point out but R is pretty smart here, knowing it can't 31 the formula and instead reverts back to default behavior.

        • You inspired me to write the following runnable R script (on a default installation):

          with(attitude, lm(advance ~ learning))

          With attitude, advance learning FTW! Greatest.Language.Ever.

    • As an advocate of R, I apologize for Arturo’s response. It’s certainly not the attitude we want to cultivate in the community.

      I agree that this behavior is confusing to new, or even intermediate R users. Hopefully a future release can improve the behavior.

    • No there definitely is something wrong with R in this case. RTFM is a terrible attitude to hold against even newbies, but against someone like Prof. Gelman it doesn’t even make sense.

      Design, whether of teapots or computer languages, is important. And one of the driving concerns should be to put information in the tool rather than in the user. That is something that Apple’s design is rightfully lauded for. Give a 3 year old a iPad and they can pretty much figure it out. Why? – Good design that puts most of the information needed to run the iPad in the display of information rather than relying on it being in the user’s head.

      If a computer language were to define the ‘+’ symbol as the multiplicative operator and the ‘*’ symbol as the additive operator, I would say that language sucks. Even if that feature were very well documented, it is not natural to a user, and it is as poor design as a backward teapot or a default object that can’t be removed using the standard “rm” command.

    • +1 I admit that you were kind of rude, but this post deserved some response like this. It is not only because Andrew is famous that he should take his mistakes as problems with the software.

    • As an R developer, it is quite useful to have the datasets in the default installation. That means that I can write sample code (aka test code) for my functions using the supplied data and just have it work.

      It is also useful for bare beginners, as I can teach them about other R functions using the built-in data without teaching them how to read data from a file first.

      This behavior doesn’t confuse me because I have a rather sophisticated understanding of R’s search and package mechanism, but for intermediate R users, it can be confusing.

      More confusing than this one, is the fact that functions will use global variables in the workspace without warning. Sometimes when I’m building a function, I’ll run the function step by step in the R workspace. That will create copies of a lot of the local variables in my workspace. Then I’ll fix a spelling error in one of those variables in one place, but not another. The code with continue to work on my test case, because I have the misspelled local variable in my workspace, but it will mysteriously fail on a different example, or when I give it to a student.

      It is part of my love/hate relationship with S (R fixed some but not all of the problems in the S language). There are lots of features that make it easy to improvise new analyses without worrying about the safety of your code. But it also makes it harder to write foolproof code. (Of course, no matter how foolproof the code, the Universe can always find a bigger fool.)

    • @Toru That sounds drastic especially when, as @Thomas shows below, it’s pretty easy to alter which packages get loaded by default by R through a simple local preference or setting (see `?Startup` for another option that doesn’t require setting an environmental variable).

      Interestingly, R Core have decide to run R with nothing but the base system (no utils, tools, etc packages loaded) when we run checks on our packages via `R CMD check`, specifically to ensure that package authors are being explicit about which functions, and from which packages, should be imported into a package’s namespace.

    • It’s useful to have the core datasets, but I think they should be given standardized names (e.g. prefixed with sample sample_data_ or something like that). It’s really helpful for testing to have the common data. Though I know way more about irises than I want.

        • The help system contains blocks of sample code that you can run to understand whatever it is that the help page is about. Many of those examples use the sample data. For example I just looked just now and found two that use the islands dataset. Now, it would be better for many of those help pages to create their own simple data for examples, either with explicit numbers or generate from a distribution? Possibly, but it does make help marginally harder to use (and has the same effect of creating objects with names).

          Are all those data sets used somewhere? I really doubt it. They seem designed for teaching in some ways, like where in 1980 you would want to use “real data” (which they are). For example the attitude data set Andrew is complaining about is from Chatterjee, S. and Price, B. (1977) Regression Analysis by Example.

          Of course you would break thousands of teaching examples, text books, and code in years of blog posts, and this is the backward compatibility issue that all languages and software face. Would that be good for R adoption? No, obviously it would be a train wreck. Thousands of people would be complaining instead of just Andrew.

          In my opinion this is an example of something that was set up early on, made sense to the people involved and helped with adoption. It doesn’t use modern software practices which would have segregated it or required a prefix or something else. Having dealt with similar issues in other language contexts, I can say that’s what happens all the time. Developers do things that make sense at the time, but years later they realize it would have been better to think about it differently.

  2. I knew there had to be a unambigously superior characteristic of Stata. In Stata, all the example data sets have to be loaded explicitly. Definitely worth the several thousand dollar difference in cost.

  3. R’s handling of namespaces is one of its warts — everything is loaded into the global namespace (yes, I know about ::). I’ve had programs fail in non-intuitive ways because it was unclear which object it was referring to. The “select” function from dplyr often gets overwritten by “select”‘s in other packages.

    R is great for interactivity but very difficult for writing reliable scripts unless one is very careful.

    • That’s not how this works but you might have that impression. What is happening is that `dplyr::select` is being masked by another `foo::select` that occurs earlier on your search path. Everything certainly isn’t in one global namespace like you describe. This is going to be a recurring problem especially when package authors use common dictionary words like this and users are going to have to get a bit more familiar with using `::` to specify which function they really want to use. But this is no different to many other programming languages. Also, when package authors, including Hadley, think it’s OK to reuse function names from the base R suite of functions but in ways incompatible with the base implementation, users are just going to have to suck this use of `::` up.

  4. I triple dare you to post this to r-help!!

    People have been flamed to death for much more subtle suggestions (I remember when Hadley Wickham started posting a lot on the stuff he was working on – he got flamed – there is even a video out there of one of his talks where he mentions this).

    BTW – I believe you can put the detach command in one of those .R files (or whatever it is called – I am too lazy to look it up) that is read at startup so that at least for your work you won’t have to keep detaching it.

    • R-help already showed up above! There were many helpful comments about detaching the datasets package (maybe via a site file) But only one r-helpful comment!

    • — I remember when Hadley Wickham started posting a lot on the stuff he was working on – he got flamed – there is even a video out there of one of his talks where he mentions this

      given how much Hadley has contributed to R, that nonsense just demonstrates the downside to inbreeding.

      my shirt is flame proof, so go ahead.

  5. My attitude towards R is similar to my attitude towards the English language: It sucks, but I know how to use it, and everyone else is using it, so I’ll keep using it.

        • +1

          All said & done R is great. Doesn’t mean one cannot complain about some warts.

          What was wrong was to sum it up as “R Sucks”. Clickbait titles suck.

        • R does not put everything in a global namespace. Each package and your current environment has its own namespace. However if you try to use the variable which is not available in your current namespace, R helpfully tries to locate it in other namespaces. In this case it found the variable in the namespace of the package. So R used it.

          Now imagine the alternative if R did not do that. First of all of R useful functions such as lm, mean, sd, etc would not be available, and you need to prepend stats::, or base:: to each of them. Secondly all the variables would be local, i.e. inside the function you would not be able to use variables which are defined outside it. Every language needs to implement some sort of variable scoping rules, which usually are examples of compromise between the ease of use and quirkiness. I personally like the R implementation, because first of all, it is documented and is consistent (it is definitely not designed by newbies), it is extremely powerful and lets you do various useful stuff, and finally I could not think of the better system myself.

          The simple way of solving Andrew’s problem is treat every variable as a reserved word and only use it if you have declared it. Then you will never have the problems of variables appearing as out of nowhere. Alternatively do not use English for naming your variables :) This strategy always worked for me, when using any programming language.

      • R sucks accept when it works. Fortunately it works quite well most of the time. So it is not Andrew’s problem that it is his point. It is that R users spend their days with quirks like this. I am having adventures with lists now. Friday I spent two hours trying to get mapply to work. It did eventually when I tweaked it. But it did not do what the help file said it should do.

  6. There must be room on the internet for an R fork that doesn’t load the datasets by default, maybe it could also set stringsAsFactors=FALSE and check.names=FALSE as immutable defaults. The only problem would be the inevitable nuclear explosion of you ever asked a question on r-help about it…

  7. Well, if you can’t be bothered to learn all the semi-“reserved” words in a language, you’re bound to get bitten eventually. Now, how many does R have?

    > s = sort(do.call(c,sapply(1:length(search()), function(p){ls(pos=p)})))
    > length(s)
    [1] 2338

    Yes, with just the basic packages loaded, there’s 2338 things in my search path.

  8. You should report the solution(s) here. There are several. The easiest is just:

    > unloadNamespace(“datasets”)
    > attitude
    Error: object ‘attitude’ not found

    Another, which is actually probably better is to always start R without the default packages loaded. To do that, set an environment variable:

    R_DEFAULT_PACKAGES=NULL

    Then start R. That will force you to explicitly state package dependencies, including those from the default packages (other than base, which is always available).

  9. R isn’t in the ALGOL/C lineage, but 99.44% of coders are. thus the never ending conflict. whether it’s worth bending one’s brain such that R’s semantics and syntax are the “natural” approach is by no means given. (I’ll shut up about environments.) the python folks, not on the lineage but closer, are gaining. remember Satchel Paige.

  10. This thread reminds me of Andrew at the Use R! conference in Dortmund (2008?), where he was a keynote speaker. In that lecture too, Andrew proceeded to tell the audience how badly R sucked. I was pretty surprised, because it just didn’t make sense that one would go to a conference as a keynote speaker and use that opportunity to insult the people who invited him. R certainly isn’t perfect, but which programming language is?

    Maybe Andrew can also write a blog post one day on how great R is; surely that thought must have crossed his mind at some point.

    • Come on folks, get a life. Anyone who reads this blog regularly knows that Andrew posts R code for his examples and analyses, and R is clearly the tool he uses. But who hasn’t, using any particular tool, wasted a lot of time because of some relatively obscure and undocumented feature and sat there and thought “this sucks”. The difference being that Andrew has a blog to say it in.

      Even more importantly, while Andrew may not be an R expert, he is probably a more sophisticated user than most, and if this bites him in the butt, related things have probably bit others. What is amazing about R is the incredible amount of effort put in by people for free. But what I find frustrating is the often incredibly defensive (to be kind) response when people find something very difficult to do, and say so (look at some of the responses in here). If a tool only works well for experts, you have a problem. And if there are design decisions that make it reasonable to do what R is doing, why not explain that rather than flame people, which also helps to educate users.

      • Very well said Roy.

        I don’t think it behooves the R community to be nasty to each other because someone complained about a feature/bug/whatever within R. Andrew has a quirky sense of humour, and I think it’s pretty clear that he doesn’t really think R sucks (e.g., he regularly posts R code for his analyses, promotes R with Stan, has written an R package, etc.). Even if you think Andrew is completely wrong, there’s more constructive ways to convey this without being rude.

        • Funny thing about the snobbish pose is that if you ask the real snobs who know about programming language design, they’ll tell you that the R as a language is a joke.

          This is analogous to slagging a user as a dumb PHP user because they haven’t internalized lolphp’isms .

  11. I get around problems like this (or at least speed up the troubleshooting) by using an IDE (Eclipse +StatEt). Even when I’m just messing around I create a script which loads the libraries I’m using (and has the unload command ready to run). There’s also a pane with a list of all my objects.

    It doesn’t solve the problem, but it makes the accounting easier.

  12. The fact of the matter is R is a poorly designed language for most modern use cases.

    With R, inconsistencies between versions seem to be evidence of a total lack of regression testing. It seems no overarching architecture was ever defined and newer versions of the language are still a mishmash of historical precedents from previous languages and dated standards with very little critical evaluation of how statistical computing can be modernized and improved for the analysis of modern data.

    To contrast, Python, while having some of its own issues, has a consistent and predictable architecture. It bridges the gap between functional programming and OOP in a clear, concise, no nonsense syntax. If you’ve worked with any OOP or functional language, you can learn Python in a matter of days. There are few exceptions, curve balls, and compatibility issues. Python may never be the best language for statistics, but it is an exceptional example of a language that has achieved its purpose manifold.

    Now, R is significantly senior to Python, no doubt, but the reality is R is at an age where it’s flaws and issues are becoming major obstacles to achieving what the goals of such a language should be: the enabling of rapid, facile, powerful, reproducible analysis. It hasn’t scaled to the modern world and it’s silly and naïve to think that R is an end all be all language for statistics and science, it’s far from it.

    Keep your eyes on Julia, it may just be the antidote so many R users are searching for.

      • That may be the case; it’s speculative at the moment. With a nascent language, adoption is like investment in a startup – it may not be the wisest way to spend your time and money (effort) but, if you’re lucky, it could yield massive benefits. On principle alone, I would argue learning another language is never a waste of time, but that’s another topic entirely.

        All said, language design and development is an evolving science. We’ve come a long way and learnt many, many lessons since the days when R was first released. Julia, like Python, is in many ways the gestalt of this process; the syntax, power, and speed (comparable to C and augmented by a parallel-first architecture) are only possible from advances in language design and development and together are a great reason for keeping a R users to be keeping an eye on it.

        • The bar for a new language to be used is very high. You cannot just be incrementally better than the staus quo language. You must be astronomically better for people to start shifting over.

          The overheads for learning a new language are high. The utility of having the libraries and community that surrounds an old, albeit flawed, language is huge.

      • I think Julia’s development testing was already much better than that post reported. As a statistician who converted to Julia from R, I’ve never run into anything like that. I think both languages are needed, but Julia will meet the needs of more modern statistical problems.

  13. You have eloquently made clear you don’t understand the search path. The ‘attitude’ package is not in your “namespace” (i.e. the global environment) so it is not “polluting” anything. Moreover you can make it disappear and re-appear at will. Consider the following:

    > find(“attitude”)
    [1] “package:datasets”
    > detach(“package:datasets”)
    > find(“attitude”)
    character(0)
    > library(datasets)
    > find(“attitude”)
    [1] “package:datasets”
    >

    Also the ‘datasets’ package does not have to be among the default packages. You can easily find out how to arrange that, too. The moral is you need to work *with* R, and not expect that R necessarily works the way you expect it should. (Here endeth the sermon.)

    • Bill:

      Obviously I don’t understand. That’s the point! This “attitude” variable was a little land mine sitting in R, just waiting to blow up on me. And it did. I’d prefer for my programming language to have fewer land mines.

      • Hi, let me also add that, as an instructor who teaches R to students every semester, I am always careful not to use pre-loaded datasets, because I think a crucial skill in using any software package is being able to work with outside data.

        • Actually in terms of teaching …. you can’t use the help system without having whatever dataset the help is using for a given function. (Well you could read but not run the sample code). One of the nice things when you make a package is that you can count on having Irises or mpg there for the example code without adding your own. And it is good for packages that can to use the core data sets because when they load datasets there are even more datasets floating around. I wonder if one could run an analysis on CRAN on how many of those dataset are actually used in at least one example. I have a feeling a lot could be safely removed. I actually was setting up some documentation for people teaching with R at my place and I was really shocked by the number (and to some extent weirdness) of core datasets.

          I find running sample code very helpful, the text in a lot of function documentation, especially the older ones, is sometimes not very clear.

        • Why couldn’t the sample code snippets add just a one liner that explicitly loads the dataset they need. Isn’t that more transparent?

        • +1. In the vignette for an R package I wrote, every code snippet has all of the “Requires” for that code snippet, rather than just at the top. That way if someone copies and pastes the code snippet, which face it is what most users actually do, the code will work or give a warning.

          There is always a give and take of users learning to work with the program and the program learning to work with users.

        • That’s awesome, did you write that up as an approach? I think everyone would agree that’s the way to go.Also make sure to detach at the end.

        • They could, but if you look at the current packages they don’t. Most that make their own data also don’t remove the variables they create. It would be an incredibly useful project for someone to take on to document how to make help files with no side effects and then go through cran and send patches to everyone. I wonder if that would be a great summer of code project or maybe someone would fund a Gnome outreach project for that.

        • “It is better to light a single candle than to curse the darkness” – old Chinese proverb.

          Perhaps you should spend the 5 minutes needed to find out about the search path and to understand it. I’ve taught S/R myself and it’s one of the earlier issues I raise – and it pays off all round. It’s key to understanding a great many things about how R works, and in dispelling misconceptions based on how people think it *should* work.

          The datasets package is there for people to use in example code without doing destructive things like

          data(attitude) ## which messes up the user’s global environment potentially killing the user’s own ‘attitude’ data, or
          library(datasets) ## (or require(datasets)) which alters the user’s search path

          A better strategy would be not to have the datasets package pre-loaded and attached but to use

          datasets::attitude

          whenever that attitude dataset is needed in an example. This would load the package but not attach it to the search path. There are two problems with this, though. Firstly, it’s a clumsy but more importantly, it would cause enormous backward compatibility problems. The construction is a relatively recent introduction into R itself.

          Finally I have to say that having a version of “attitude” as it is by default, i.e. visible on the search path, but not in your workspace, should not be a problem, if you know what is going on (and you do need to know that, sadly). Some people even use the slightly paranoid tactic of starting off every script with the brutal

          rm(list = ls()) ## Not for me, but others do…

          which at least guarantees that any version of ‘attitude’ you can still see will NOT be yours. You have a clean slate.

        • Bill:

          I don’t disagree with anything you write. All I can tell you is that I’m not the only one who cuts and pastes lines from a script into the R terminal, and all sorts of wacky things can happen, including the incredible-but-true story of an existing preloaded variable called “attitude” (one of 2338 such objects in R, according to one of the commenters above). Backward compatibility is an issue, I’ll grant you that. In the past, R has dealt with such backward compatibility issues by deprecating things that no longer belong, but, sure, it’s a cost.

        • OK Andrew, just for you, I have a solution to all your pain and frustration. Just copy-and-paste this into an R session

          cat(“R_DEFAULT_PACKAGES=methods,utils,grDevices,graphics,stats\n”, file = “~/.Renviron”, append = TRUE)

          and restart R, and you will never be bothered by that pesky datasets package (on that machine) from that point onwards. Honest.

          You still seem to be saying “R does not work the way many people think it works, so we should change R”. I must respectfully disagree. We must work to enlighten people on how to work with R, not struggle and agitate against it unnecessarily. In this case, it really is unnecessary if you understand how R works. Sorry, but that’s the reality.

        • Bill:

          I’d rather not have to add this line to my R code. But, sure, it’s good to have a solution. I used to add things to my Rprofile file but that created some other problems: (1) my code wouldn’t always work for others that didn’t have that Rprofile file, and (2) every time I updated R, I’d have to manually screw with that Rprofile file. This was all about 10 years ago; maybe (2) has been fixed since then.

          And, again, I agree that if I understood R better I’d have fewer problems with R. But a language is not just for its experts! And I’m pretty sure that the vast majority of R users is much less expert than me. In any case, it’s too bad that R is loaded with all these old datasets from decades-old textbooks, but if they are going to be there forever because of backward compatibility then, yes, I guess we have to warn users that they’re there. Also it’s perhaps good to agitate a little bit here so that designers of new languages like Julia can learn of problems to avoid.

        • Professor Venables,

          I love R. I’m an R power user; I’m familiar with the search path mechanism and would have diagnosed the attitude variable issue right away. I make my team members use R even though I suspect they’d be happier working in Octave. (First-class functions and closure are worth it!) But there can be no denying that the fact that R’s variable search mechanism is a major source of gotchas and bugs for beginners and intermediate users — and to me, that means that by definition there’s a design flaw in its implementation. The silence with which R goes outside of the working environment to find variables is the main culprit; a warning thrown by default and a mechanism with a simple syntax that the user can use to his/her intention to get a variable from outside the working environment (and hence suppress the warning) would go a long way.

        • I’m probably not as much an R power user as Corey, but I am a computing power user in general. For example, I’ve programmed significant amounts of all of the following languages:
          C, C++, Common Lisp, Scheme, Prolog, R, Perl, Python, and probably a few others that I’m forgetting at the moment. And not just toy programs, actual projects that accomplish complex tasks.

          That being said, I feel like R’s scoping system is wonky. I keep expecting it to work like Scheme (lexical), and as I understand it, this was an important goal of the R project as distinct from S, that it be more lexical. But even just the other day I was bitten by something where I had to explicitly call “local” to enforce a local variable, and it just did not seem that this should be required.

          the basic pattern was generate a random n, then call a function using the value of n, like foo(n) which was returning a closure that was supposed to capture that value of n and give me a function of one variable that I could plot…. but every plot wound up using the same n, because the global n was being closed over, not the local n…

          I suspect some of this may be the ggplot2 system, which may be doing some kind of macro thing where the “foo” wasn’t evaluated until the plot was created, except that by using “local” I managed to make it work, but in any case, it was weird, it wasn’t anything like lexical, and it wasn’t obvious why.

        • Also, despite whatever hinks and kinks R has, I started using it in 1999, and have been a huge fan of the project ever since. Primarily the way in which I use it is to write “batch” programs typically with the following structure:

          1) load all the libraries I’ll need
          2) read in datasets off disk, internet, or connect to databases especially SQLite and MySQL databases
          3) munge data into appropriate forms (SQL queries, sqldf statements, regular R subsetting, random sub-sampling etc)
          4) run “statistical” analysis, including generate graphs, smoothing, analysis of subsets, fitting models including Stan models, and otherwise turning the raw data into useful information.
          5) clean up (delete temporary files, upload results into DBs, etc)

          if your R code has this kind of structure, you’re likely to avoid much of the gotchas, since you explicitly create your own data sets early in the process.

      • I’m sorry that many responses here have been rude, Andrew.

        I would say this:

        The “landmine” you stepped on is a legitimate one to discuss. It is important to recognize, though, that the landmine is only triggered by an explicit coding error on your part. In fact, you admit as much in your post:

        “Yes, when I define my own “attitude” variable, the preloaded version in R is hidden…”

        That doesn’t mean it isn’t a problem, but it does significantly color how we should view it’s severity and our possible remedies.

        A more precise statement of the issue you experienced is that having the datasets package loaded by default means that when a programmer makes a certain specific coding error (referring to an object they haven’t created yet that just by chance happens to have the same name as one of the ~100 data sets in the datasets package) then that programmer’s code will fail _in a way that is more confusing and difficult to debug_ that it ought to be. Normally you’d just get an “object not found” error, which is reasonably clear.

        So in this specific case it’s not that R is actively behaving badly, it’s that in certain, very rare circumstances, it behaves in a way that makes a mistake that a programmer might make more difficult to recover from than we might like.

        I think what a lot of more experienced R folks may have found confusing in what you wrote is that despite saying flat out that you knew defining your own version of attitude would solve the problem completely, you seemed to spend a lot of (frustrating) time doing something unrelated to the solution you already knew about (trying to somehow “remove” the attitude dataset). I mean, removing just attitude wouldn’t do anything to prevent the same issue with other names, so it’s kind of bizarre why you’d even try at that point. But anyone who has written & debugged code knows the sort of crazed frustration that can result, even for the most knowledgable.

        So my charitable read on this is that you wrote this while still pretty frustrated and angry, without thinking very carefully about it, and the result was some muddled, confusing writing. Whatever, no big deal. We’ve all been there.

        BUT! It’s just not fair for you to write blog posts with the title “R Sucks” where the complaint expressed is a bit muddled and hard to follow and expect everyone to respond seriously and calmly. “R Sucks” is simply not the title of a post putting forward a good faith, serious complaint about R. It’s the title of a rude, immature rant. There’s no defense for the rudeness directed at you personally in some of these comments, but if you want your feedback on R to be treated seriously you should probably reconsider your general tone and language as well.

        “Frustrated with R” or “Confused about R” would both have been fine, and accurate. You’d still get rude responses (this _is_ the internet) but at least the people who matter will be much more likely to take you seriously.

        • Justme:

          Fair enough. What happened was indeed that my code was tangled; I was running different parts of my script out of order (bad practice, I know, and it’s bad practice that I’ve seen from others too!), and I ended up running code that used my “attitude” variable, but in an R session where I’d not computed it. The problem came up that my code ran, or seemed to run, and it happened this way because R was pulling up that “attitude” variable out of a preloaded dataset, and I had no reason to think that this variable name would be in use. So what we have is a problem in R that activated itself because of my own mistake. But it was frustrating to me because it didn’t need to happen—it was entirely unnecessary for R’s “attitude” variable to be hanging out there in the first place!

          Regarding the title of the post: My first audience for this blog are the regular readers. These are the readers who saw my World Cup post and my climate bet post and my death rates post, the readers who know that I use R all the time, the readers who know that I teach R in all my courses, and who can interpret my titles appropriately. But, sure, I recognize that new readers won’t always get the point.

        • For the record, I really am sympathetic to your complaint that loading the datasets package by default is probably not necessary, even if it’s triggered by “user error”. Changing that wouldn’t solve this entire category of problem, not by a long shot. The user may encounter a name collision with an object they thought they had created via a similarly named object in any other package they might have loaded, but it would at least reduce the amount that R is actively contributing to this sort of thing.

          My point in saying that the problem originated with your mistake was not to blame you, but to put in context how commonplace and serious this problem really is, from a design perspective. The frustration you experienced was real, and it was made worse by R.

          I’ve also been a passive observer of R-devel long enough to basically assume that any behavior in R that I find infuriating and can’t imagine how anyone could possibly defend it will have people who actually _depend_ on that behavior and changing it would be a huge pain in the ass for them. I would genuinely be curious to see this batted around on R-devel.

          I’m going to be honest and say that your claim that “regular readers” should be able to “interpret my titles appropriately” is disappointing to me. Maybe this is not quite what you meant, but as a regular reader (for many years) who knows that you use (and like) R, it kind of stinks to be told that the reason I’m annoyed about your choice of title is not because it was unnecessarily inflammatory but because I wasn’t “interpreting it appropriately”.

          A lot of the rudest comments here directed at you were “blaming the user” for something that R was genuinely making confusing. Writing a post entitled “R Sucks” and then saying that annoyed readers aren’t “interpreting my title appropriately” is no different, you’re simply blaming the reader for a perfectly natural mistake.

        • Justme:

          I wasn’t blaming the reader, I was just explaining my process. But, sure, if someone wrote a post called Stan Sucks or BDA Sucks, I’d be pretty annoyed! So I can understand the reaction of some of the more negative commenters here.

        • @justme just because you can work around a poor choice of language design with experience doesn’t mean that it was a poor design option.

          For there to be such namespace pollution present by default _is_ poor design, although unfortunately it’s far from the worst design decision of the language. So no, to say “R sucks”, tongue in cheek or otherwise, is fair.

    • So why the heck is the “attitude” (or any) dataset attached by default?

      R doesn’t have to work as Andrew expects, but that doesn’t mean it should go out of its way to adopt poor design choices.

  14. I have run into a similar type of problem with excel. I often have parameter names in a column to the left of the parameter values and use the select area define name using names in left column feature to give the cell names. If a parameter has a name like a1 or ax1, then the name matches actual cells in columns a and ax. Excel does append an underscore to the name, but if you forget about the conflict…

  15. A well-known principle in programming language design is that the language (or compiler, or run-time environment) should detect instances in which a variable or function is used prior to being defined. R violates this. This could be fixed any number of ways (e.g., by warning that attitude is referring to an existing object, by not having “attitude” preloaded, by having a module structure and requiring references external to the module to be syntactically marked, etc.). It is a pity that non-computer scientists design and release languages without having any peer review by programming language designers. It is a pity that computer scientists do not engage (in time) with communities that need new languages. The result is things like matlab, R, and GIS systems that violate established design principles. How many scientific papers are wrong because of programming language design errors?

  16. Radford Neal has written on this general topic as well:

    Design Flaws in R #1
    https://radfordneal.wordpress.com/2008/08/06/design-flaws-in-r-1-reversing-sequences/

    Design Flaws in R #2
    https://radfordneal.wordpress.com/2008/08/20/design-flaws-in-r-2-%E2%80%94-dropped-dimensions/

    Design Flaws in R #3
    https://radfordneal.wordpress.com/2008/09/21/design-flaws-in-r-3-%E2%80%94-zero-subscripts/

    What makes R great is the diversity of packages available for it, and some notational conveniences. The S language itself (which R implements) is an absolute horror; it’s like Perl for statisticians.

    • Kevin:

      You write, “What makes R great is the diversity of packages available for it, and some notational conveniences.” Also R has a large user base (network effects), it’s open-source (thus, easier to track down bugs), it’s free (that’s a huge advantage, both in itself and also because then there’s no problem including it in classes), and it’s designed by statisticians (so it solves a lot of problems that statisticians want to solve). So, sure, R is annoying, but it has a lot going for it!

      • Yes, this is the core of R’s faults in general: horrid choices of defaults.

        I’ve seen many programming languages in my life, and I have to say S is one of the absolutely worst. Many of the abominations that some of the students I teach come up with (those who never programmed in their life and R is the starting point) is due to non-sense such as having variables set to global by default.

        The fact that R can be so useful and successful despite this starting point is a testimony to the skills of the R creators and community. And also explains why it is perfectly possible to say “R is great” and “R sucks” without being contradictory. It is a pity that some may be so defensive about it, bad design is bad design, RTFM is never an excuse for that.

        • Worst aesthetically, but one of the best for getting things done quickly. It’s main competitor SAS is the absolute worst on both counts.

  17. R was released in 23 years ago, in 1993 [Wikipedia]. It has not kept up with the advances in programming languages. It is therefore, comparatively speaking, unable to support the level of reliable software design and lifecycle to which we have become accustomed. R is full of quirks and has awful performance [1] [2] [3].

    A few examples are in order. Malleability is generic but does not support generic programming. The type system, while based on attributes, is incomplete and inconsistent. Extensibility is flexibility lacking well-conceived support for collaborative agile programming. The presence of a class construct does not ensure consistency or reuse. A plethora of packages simply means none are sufficient. The purported advantages of R are actually the very problems that plague its adherents.

    It is a lot like free beer, and while it has lots of calories it isn’t particuarly nutritious.

    [1] Why has R, despite quirks, been so successful? David Smith 2015
    http://blog.revolutionanalytics.com/2015/06/why-has-r-been-so-successful.html

    [2] Evaluating the Design of the R Language; Floréal Morandat, Brandon Hill, Leo Osvald, Jan Vitek; Purdue University, USA
    http://r.cs.purdue.edu/pub/ecoop12.pdf

    [3] Advanced R.; Hadley Wickham
    http://adv-r.had.co.nz/

    • Bingtop:

      What is your preferred statistical programming language and why? Is it better graphically?

      I ask this because one of the things I like best about R is how easy it is to produce really nice graphs with just a few lines of code, but I am always open to new and better methods. Does your preferred language have a similar capability?

  18. Hello,
    I am a regular user of Stan and I find this post is very amusing and useful.
    A. Stan is magic
    B. I have some beef with Stan. Many time variable names like “t” or “diff” are unusable because Stan has some preprogrammed function or assignment to those names.
    I would like to see a list of all things that we can’t call our parameters or data in Stan.

    Still, your work is great and one can always find a reason to complain.

  19. Andrew, I just thought I’d mention that I’ve just seen an advocate in an institution use the title and a few excerpts of this article to push an anti-R agenda… just FYI

  20. I was writing an example this week for my class and the sample datasets chickwts had the right combination of variable types and number of observations. But I had to add a comment “Make sure not to accidentally use ChickWeight.”

Leave a Reply

Your email address will not be published. Required fields are marked *