Stupid R Tricks: Random Scope

Andrew and I have been discussing how we’re going to define functions in Stan for defining systems of differential equations; see our evolving ode design doc; comments welcome, of course.

About Scope

I mentioned to Andrew I would prefer pure lexical, static scoping, as found in languages like C++ and Java. If you’re not familiar with the alternatives, there’s a nice overview in the Wikipedia article on scope. Let me call out a few passages that will help set the context.

A fundamental distinction in scoping is what “context” means – whether name resolution depends on the location in the source code (lexical scope, static scope, which depends on the lexical context) or depends on the program state when the name is encountered (dynamic scope, which depends on the execution context or calling context). Lexical resolution can be determined at compile time, and is also known as early binding, while dynamic resolution can in general only be determined at run time, and thus is known as late binding.

…scoping rules are crucial in modular programming, so a change in one part of the program does not break an unrelated part.

R, on the Other Hand

R’s scope rules can be quite confusing. First, it lets you define a function with an undefined variable. Start a new R session and do this.

> f <- function(x) { a * x }

> f(3)
Error in f(3) : object 'a' not found

But then we can define a and all is well.

> a <- 10
> f(3)
[1] 30

So clearly the value of a is getting set dynamically at run time. But then what if I do this?

> g <- function(y) { a <- 100; f(y); }

> g(3)
[1] 30

Now even if a had not been defined in the global scope, the call to f(y) in the definition of g would not pick up the definition in the body of g.

I expected given the dynamic nature of the definition that the answer would be 300, not 30. It seems that what’s happening is that the location of the variable a is defined when f is first defined, not when f is used.

But the value is whatever is defined in the global environment at the time the function is called. For example, redefining a produces a new value for f(3):

> a <- 20

> f(3)
[1] 60

Stupid R Trick1

Now for the stupid R trick.


Update from comments: Peter Meilstrup provided a link to a comment on Christian Robert’s blog by Ross Ihaka, which turns out to be where I first saw this idea:

Suppose I define a new function h as follows.

> b <- 20
> h <- function(x) { if (rbinom(1,1,0.5)) { b <- 1000 }; return(b * x); }

and then call it a few times

> h(2)
[1] 40

> h(2)
[1] 2000

Whether the value of b is the local variable set to 1000 or the global value set to 20 depends on the outcome of the coin flip determined by calling rbinom(1,1,0.5)!

Presumably this is the behavior intended by the designers of R's scoping mechanism. But I find it very confusing.

If you want to read more about scoping in R and S, John Fox has a document on CRAN,

Closures

If you really want to understand what's going on in R, you'll have to also read up on closures. Then examples like the following will make sense.

> ff <- function(x) { g <- function(y) { return(x * y) }; return(g) }

> ff(7)(9)
[1] 63

What's going on is that R uses the scope of a variable at the point at which the function is defined and that inner function g is not defined until the function ff is called.

The "stupid R trick" is simply based on making the variable's scope non-deterministic.


1 My apologies to David Letterman and his stupid pet tricks; who knew they'd take over the internet?

33 thoughts on “Stupid R Tricks: Random Scope

  1. The main issue with both your examples is that you fail to supply arguments for the functions that take *all* the inputs required by the function. Hence you should have:

    f <- function(x, a) { a * x }

    then

    R> f(3)
    Error in a * x : ‘a’ is missing
    R> a f(3)
    Error in a * x : ‘a’ is missing

    The same thing can be done with your h().

    Should R enforce this? If you know what you are doing you can use lexical scoping to your advantage. If you are writing functions in R, you should do as John Chambers’ (creator of S) suggests you do in a functional language and write functions that take arguments for all things required by the function.

    • Something got dropped from the code chunk. Between the two errors I assigned `a` the value `3` as per the example above. Calling `f()` after this assignment resulted in the same error.

      • Possibly of interest to Bob and others: the `findGlobals` function in the `codetools` package (http://www.inside-r.org/r-doc/codetools/findGlobals) can help ensure that your functions aren’t relying on global variables in unexpected ways.

        If a function is really long, it ends up using a lot of built-in functions like `==` and `[`, so the output isn’t very readable, though.

        I usually look at the intersection of [variables in my global environment] and [variables returned by findGlobals] to see if there are any problems

  2. Your first example has R actually more well behaved than your expectations.

    > g <- function(y) { a <- 100; f(y); }
    > g(3)
    [1] 30
    “I expected given the dynamic nature of the definition that the answer would be 300, not 30.”

    Imagine that you have a function buried deep in your code that uses a inner value “temp,” then you update a single line in your code to define a global variable “temp.” With the behavior you seem to be expecting, suddenly your inner function starts overwriting a global value where it previously held a local value.

    Referential transparency demands that “function (x) {a <- 100; x*a}” should always do exactly the same thing no matter where it is written — particularly, whether or not there is a global “a”.

    It sounds like you were expecting MATLAB or CoffeeScript scope. Please don’t make Stan do that.

    The “stupid R trick” example is a well known flaw in R scoping — what R should be doing is making every variable that appears left of a <- start out undefined and local to the function. (That or requiring explicit declarations, but scripting language creators don’t seem to like doing that.)

    • Right — my expectations were set by thinking that the scoping would be dynamic (because the functions are defined dynamically and the lexical scope is defined dynamically). But now I think I now understand what R’s doing with lexical scope, though R continues to surprise me by its behaviors. I have no idea what MATLAB does and have never heard of CoffeeScript, so I wasn’t working by analogy to those.

      Is there a reference to this well-known flaw in R’s scoping? I couldn’t find one?

      In terms of functions always doing the same thing, the “stupid R trick” example shows that you can’t tell by inspecting the code where a variable’s scope is — you have to wait for runtime information. The scope gets resolved when the function is actually defined to be the narrowest containing scope where the variable exists, and if none exists, goes with global scopes and hopes the variable’s defined by the time it needs it.

      My preference is to have Stan’s functions behave like C++ function definitions. We’ll see what everyone else wants.

      • I don’t know what you really mean by “dynamic” here. Variable bindings are resolved at runtime, which is one definition. In fact no resolution of variable bindings is done at all when a function is defined, which does not seem to be how you’re conceptualizing it.

        It’d be more accurate to call what R does “deep binding” rather than “lexical scope.” Bindings are dynamic, but lookup proceeds up a stack of contexts associated with each function, rather than the stack of calls (as in “shallow binding”).

        I think there is a point of friction in making analogies between a language where variables are always explicitly declared (C++, etc) and scripting languages where the assignment operator also doubles as a declaration. If you have to explicitly declare all variables this issues here go away. If not, you have to contend between “does a=1 assign to the existing a, or create new local a?” (where CoffeeScript and Matlab make the wrong choice, in my opinion)

        Ross Ihaka, one of R’s creators, has talked about the “random scope” example as something that he’d do differently: http://xianblog.wordpress.com/2010/09/13/simply-start-over-and-build-something-better/

        • Thanks for the reference — I updated the blog post with the citation — that’s where I first saw it myself.

          I am thinking of variable bindings as being done when a function is defined. What’s an example of how it proceeds up a stack of contexts? The contexts are associated with the function when it’s defined, though, not when it’s called, right?

          I agree that the analogies are stretched — that’s what I was trying to say in another comment.

        • Right, contexts (environments) are remembered at each creation of a function, but variable lookup is done as late as possible.

          Environment chaining becomes more visible when you have nested definitions.

          a <- “zero”
          f <- function(def1, def2) {
          if (def1) a <- “one”
          g <- function(def2) {
          if (def) a <- “two”
          function() a
          }
          g(def2)
          }

          x <- f(TRUE, FALSE)
          x()

          Here “x” gets a function (function () a) which is defined in a call to “g”, which is itself defined within a call to “f”. “a” might or might not be defined locally during each call.

          When you call x(), it looks first for “a” in the context where it was defined (the previous call to “g”), then in its parent (the previous call to “f”), then finally in its grandparent (global). (To be clear, the fact that “a” can be alternately present or absent in an environment depending on data is not a good feature, just used here to illustrate environment chaining.)

          You can explicitly inspect the chain of environments:

          ls(environment(x))
          ls(parent.env(environment(x)))
          ls(parent.env(parent.env(environment(x))))

          Some people are recommending to pass down all arguments to all functions as a general practice. That’s a decent starting practice, but I would argue it turns into a maintenance cost when G takes an argument X and does nothing with it but pass it to F which does nothing with it but pass it to H which finally does something. If you wrap your head around closures you can insulate G and F and H from having to know about X, without making X global.

          (Coming from C++ or Java you can think of using closures the way you might use little helper classes. Arguably, 2/3 of the book “Design Patterns” was about showing how to make classes emulate simple patterns of using closures.)

        • That’s a decent starting practice, but I would argue it turns into a maintenance cost when G takes an argument X and does nothing with it but pass it to F which does nothing with it but pass it to H which finally does something. If you wrap your head around closures you can insulate G and F and H from having to know about X, without making X global.

          Not infrequently I have this exact problem, and I would love to get a pointer on how to understand closures so as to avoid it. Maybe a code snippet?

        • Closures are really quite easy to understand. A closure is a function PLUS the environment that it was in when it was instantiated.

          Writing code in this blog is a pain because it swallows > and soforth. Email me with an example question and I’ll email back an example answer. If it seems worthwhile I’ll stick it in a blog post.

  3. R is lexically scoped. In your first example, you can look at the function and know that the value of a comes from the global environment. If your function g() could set the value of ‘a’ used by f(), that would be dynamic scoping. The behaviour of R is like the use of static variables in C++ or Java.

    • R’s lexically scoped, but not in the standard way where you can inspect the source and tell where a variable is resolved. That was the point of this particular “stupid R trick”. In R, you have to wait until run time when the function definition is invoked and the scopes of the containing environments are resolved. Then you know where it’ll scope to.

      I wouldn’t say that’s like C++ or Java, but it’s just an analogy. In C++ or Java, if you have a variable inside a conditional block, it’s local to that block. In R, it escapes to the containing environment as the example shows.

      As far as I know, neither C++ or Java admit anything like this particular “stupid R trick” in their function definitions.

      • The problem is that variable scope is a bit leaky in R. Although variables don’t leak out of functions they do leak out of control structures when you might reasonably expect them to go out of scope. So index variables in for loops still exist after the loop has exited, and in Bob’s random example, a variable defined inside a branch of an if/else statement still exists after that branch has executed.

    • There can be any number of ‘environments’ betwixt the one assigned the function and the global environment. The interpreter walks the stack looking for the entity and stops when it finds it. That stopping point isn’t necessarily the local or global environment. Now, whether one wants to/should build code with more than these two environments in existence is another question.

  4. I’ve always wished for something like “use strict” that I could optionally add to the start of an R function to prevent the use of global variables. Or maybe “allow global” that would do the opposite. I always always always always send all parameters explicitly, except (1) on very rare occasions I don’t, but more important (2) I sometimes _accidentally_ use a global variable. The former is not a problem because it’s very rare and I’m aware of it, but the latter can be really bad. For instance, I’ll have a variable called max_val inside a function, and I’ll accidentally write val_max instead. That’s usually no problem — I’ll get an error the first time I call the function — but if I happen to have a variable called val_max in my global environment it can be a disaster. This is one reason I try to keep my top-level workspace clean.

    • I agree. I like strict static typing for the same reason.

      It’s just so tempting in R to work with globals, then you encapsulate something in a script and it leaks when you try to share it with someone else.

      • In my R projects, I’ve found it useful to divide my work into two different mental spaces: proto-packages that contain functions for which all arguments are passed explicitly, and project-specific scripts in which I define global variables and helper functions that might use them. If I want to move a helper function to the proto-package space, I internalize the global variable accesses and then verify that the original helper function and the modified scope-safe function do the same thing.

        I’ve found this to provide a relatively safe and robust way to “give in to the temptation” of using global variables and still enabling worry-free code reuse. (It’s a bit of a PITA if more that one global variable gets changed in the execution of a helper function, since this corresponds to “multiple outputs” for the local-scope version, so I try not to do that.)

    • You could try this sort of thing:

      testfunc ≤- function() cat(“testing: a = “,a,”\n\n”)
      environment(testfunc)
      parent.env(environment(testfunc))
      environment(testfunc) ≤- parent.env(environment(testfunc))

      It’s not real local scope, but at least R won’t be searching the global environment for values of variables local to the function.

  5. I can’t see any stupidity, if you follow Gavin’s advise above and pass through all the arguments used inside the function!

    • Andrea, see my post above. Actually I just had this issue come up a couple of days ago and I should have used it as an example: I was writing a function into which I passed all of my arguments explicitly like a good boy. Inside the function I was looping through data one month at a time, making predictions for something, and storing them with the line

      OutMatrix[iMonth] = predValue

      But guess what? My looping variable wasn’t iMonth, it was imonth….and I had a variable iMonth in my global environment. The error wasn’t hard to find, but in other circumstances it could have been.

      And also: what would be bad about giving me the option to avoid this behavior? The default could remain the way it is, but I would like the _option_ of preventing global variables from being used in a function.

    • The title was a reference to the running segment on The David Letterman Show (an American late-night talk show) called “stupid pet tricks” where people brought on their pets to perform silly tricks. The random scope is a silly trick.

      I think the mistake in R is to let variables escape from conditional scopes — you don’t see that in other languages. If that didn’t happen, you wouldn’t have this problem.

      • Thank you for clarifying the meaning of the title.

        I see two kinds of code, “high quality code” and “ad hoc code”.

        When I write functions belonging to the first kind of code, then I know I have to care about passing arguments to my function and about not to use undeclared (@Phil: or wrong spelled) variables inside my function etc.

        The other case is when I write ad hoc functions. That is in isolated situations where “I know what is going on”. Then I do not care about arguments and even not about returning values. So in the most extreme case I write (remark the “<<-" sign):

        f <- function() a <<- a+1
        a <- 1
        f()

        It is fine for me, that both cases "are allowed" in R.

  6. The direct link is http://obeautifulcode.com/R/How-R-Searches-And-Finds-Stuff/

    And the article is about environments and stack frames. It also confirms that it’s the environment in which a function is defined that determines its scope — that’s the sense in which R is what is known as “lexically scoped” rather than “dynamically scoped.” As I replied in another message above, the underlying issue allowing the “stupid trick” is having variables escape from conditional blocks into the containing block. Without that behavior, you couldn’t get the random scope.

    • I agree here. That local variables escape their block scope is the problem in R. Otherwise R seems properly lexically scoped to me. IF you define something at top level, and it has a free variable, the ONLY place it makes sense to look for it is in the global scope. Most people coming from a C++ type background just don’t get the idea of a closure in the first place and so a “free variable” just seems to be an error.

      In all other respects, R seems to act like Scheme and my understanding is that R was motivated by starting with S and adding scheme like lexical scoping (with the error you mention above apparently falling out of it all).

  7. You are approaching this name scoping issue from the wrong direction. You need to adopt the world view of a currently in fashion programming language guru; the answer will then immediately become obvious.

    Having you social science types get involved in language design is very dangerous. I know what happens next, you will be wanting to run experiments to compare the various possible approaches.

    The R approach is from the 1970s:
    http://shape-of-code.coding-guidelines.com/2013/04/02/push-hard-on-a-problem-here-and-it-just-pop-up-over-there/

    • I think this discussion can be clarified by considering the difference between parent frames and parent environments in R; here’s a nice short example:

      Stack overflow answer about parent frames vs. parent environments in R

      The other thing that will help someone understand what’s going on in R is an understanding of closures, as I said in the body of the article. The Wikipedia article has a simple description and an example from Python:

      Wikipedia article on closures

      Here’s how to translate the Python example in the Wikipedia to R, which combines ideas from the two links:

      > ctr <- function() { 
        x <- 0; 
        incr <- function(y) { 
          ctr_body <- parent.env(environment()); 
          ctr_body$x <- ctr_body$x + y; 
          print(x) }; 
        return(incr); 
      }
      > ctr1_incr <- ctr()
      > ctr1_incr(1)
      [1] 1
      > ctr1_incr(7)
      [1] 8
      > ctr2_incr <- ctr()
      > ctr2_incr(1)
      [1] 1
      > ctr2_incr(7)
      [1] 8
      

      (The returns won’t really work from the command prompt in R — I just used them because it’s too fat to fit in the body of the blog comment otherwise.)

      To clarify what I meant in the body of the post, there are two things I find confusing about R’s scoping. And I don’t mean I don’t know how they work, just that it makes code confusing.

      Confusing thing 1) There’s no notion of a declaration of a variable; variables are implicitly declared by being used on the left-hand side of an assignment (and perhaps in other ways — I’m really not that good with R). In languages like C++, the notion of declaration and definition are separated. For instance,

      double x;
      

      declares the variable x in the local scope, whereas

      x = 10;
      

      defines the variable. Of course, both can be done at the same time, but they shouldn’t be conceptually confused:

      double x = 10;
      

      So had I written the above R example with

      x <- x + y
      

      that would be taken to declare (and define) a new variable.

      Confusing thing 2) Scope leaks out of blocks in control statements. I'm not familiar with the guts of R, so I have no idea why it works this way. The person who designed R, Ross Ihaka, said it was a mistake. So I'll take his word for it that it's not a good idea! It's confusing if you're used to languages like C++.

      Finally, to answer the question raised in your blog post linked above, the reason to dislike global variables is that they violate the principle of encapsulation. Encapsulation is a pleasant property because it allows you to reason about how a piece of code will behave by only looking at its local definition. When you have globals, you have to worry about what other code might do to it, which means you potentially have to look everywhere in the code.

      P.S. For what it's worth, I'm a computer scientist by training, not a social scientist.

  8. Thanks, useful discussion.
    My wife was taking a Coursera R course, and had problem thatt I think was akin to this, at least some kind of scoping issue, always troublesome when either unfamiliar or hard to figure.

    To some extent, this reminds me of the 1960s Algol “call by name” arguments.

Comments are closed.