I was trying to make some new graphs using 5-year-old R code and I got all these problems because I was reading in files with variable names such as “co.fipsid” and now R is automatically changing them to “co_fipsid”. Or maybe the names had underbars all along, and the old R had changed them into dots. Whatever. I understand that backward compatibility can be hard to maintain, but this is just annoying.
I think R’s behavior for handling special chars in column names has always been suboptimal. Often the first thing I do after reading a dataset is to sanitize the names in a way that I like. the gsub function can be helpful here.
I’m not able to fully reproduce this behavior, I used the latest version of R (beta).
Can you share some codes to reproduce to error ?
df <- read.table(text = "A.b C.d
1 3
10 11",
header = TRUE)
str(df)
## 'data.frame': 2 obs. of 2 variables:
## $ A.b: int 1 10
## $ C.d: int 3 11
sessionInfo()
## R version 3.0.1 Patched (2013-06-28 r63090)
## Platform: x86_64-unknown-linux-gnu (64-bit)
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
## attached base packages:
## [1] stats graphics grDevices utils datasets methods
## [7] base
## other attached packages:
## [1] XML_3.98-1.1 stringr_0.6.2 reshape2_1.2.2 plyr_1.8
## [5] MASS_7.3-26
## loaded via a namespace (and not attached):
## [1] compiler_3.0.1 tools_3.0.1
I cannot reproduce this behavior either with the current version of R. I believe it was because an old version of R changed the underscore to a dot, but more recent versions of R can respect the underscores now.
> df
> df
A.b C.d
1 1 3
2 10 11
> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
I’m pretty sure is that Andrew’s problem is that he has column headers with underscores which *used* to get converted to dots by make.names() (called from read.table()) but no longer do. In fact, ?make.names shows that there is an “allow_” argument (“for compatibility with R prior to 1.9.0”):
> make.names(“a_b”)
[1] “a_b”
> make.names(“a_b”,allow_=FALSE)
[1] “a.b”
So following x <- read.table(…) with setNames(x,make.names(names(x),allow_=FALSE)) should
reproduce the old behavio(u)r.
This seems to do the trick:
library("Defaults")
setDefaults(make.names,allow_=FALSE)
read.table(text=
"first_col second_col
12 34
",header=TRUE)
## first.col second.col
## 12 34
I could never get used to dots in variable names. I favor the underbar/underscore.
Bob convinced me that underscore is better. But when I have old code, I’d like to just be able to run it.
Why is underscore better? It’s annoying for me, because ESS in Emacs always converts it to <- unless I quote it first. I prefer . And cleaning up names myself makes the code easier for me to maintain when I'm collaborating with folks who change variable names on me and capitalize inconsistently.
ESS will include an underscore as underscore (instead of <-) if you just hit underscore twice. Much easier than C-q _.
I’m a much faster typer than ESS gives me credit for, I can type <- basically three times as fast as I can recover from typing _ and having it display <- and having my brain say "hit delete something went wrong".
How can I turn off this *feature* in ESS that has always annoyed me!!!
(ess-toggle-underscore nil) in your .emacs file, I think.
Excellent thanks. I used to just think it was mildly annoying, but ever since starting to use ggplot2 with all its “geom_line” and “coord_cartesian” type function names etc it’s become a huge annoyance.
`_` was not allowed in R a names long time ago as it was synonymous with the assignment operator `<-`. R is not doing any such conversion today (modified from Ahmadou's comment)
> df df
A.b C_d
1 1 3
2 10 11
This change was long requested by UseRs before R Core relented. The problem is the cleaning up that goes on when R reads in data to data frames – you have to turn off a lot of things if you want to read the file verbatim, but then you need to work harder as variable names may not be usable in R without quoting them.
I now routinely read in data and then clean up the variable names by explicitly setting them. That way even if R’s sugar changes I have set the variable names myself so the rest of the code will continue to work. This assumes the underlying data file doesn’t change but I have a personal rule not to change raw data files.
Something went wrong with the code snippet I posted – stripping of some tags perhaps? – anyway, this is what it should have been
> df df
A.b C_d
1 1 3
2 10 11
Brute force fix: Download a 5 year old R version and use it!
Rahul:
Good idea. I have to admit, though, that the backward compatibility problem is mine as well as R’s. In addition to the problem with the dots and underscores, I also had some code out of order, and some data files were in the wrong directories. I eventually did get it to work (using R 3.0), though.
Andrew’s problem derives from the absence of commercial purpose in the R project (which is not to say that there is anything wrong with the design of the R project). Any commercially-oriented solution would strive to include a system to preserve the execution of legacy code: see, e.g., the
version
command in Stata. The R project gets its user base from many qualificatives like “state of the art” and “addictive like crack”; “version resilient” is not one of them.A simple fix is Rahul’s fix. A less simple fix would be to bundle all versions of R together into a single software package, and to allow the user to set the version of the interpreter. I might be underestimating the feasability of that task, beyond the size of the resulting package. But if some people are developing faster versions of R, it should be possible to develop a mere collage of all R interpreters, right?
This has several implications for teachers too, since teaching R is teaching something that is moving freely, quickly and in many directions (e.g. knitr, ggplot2, Shiny, rCharts, Rcpp, RHadoop). The size and anarchic nature of the package ecology is one thing that you end up teaching to students who are used to much more hierarchical representations of social order. It also connects well with other learning themes in open access statistics.
It’s a tough tradeoff. It’s hard to maintain backward compatibility and add features or speed etc. at all times. A common software engineering problem.
Python has a similar break in compatibility. In clusters I’ve administered (Linux) it was fairly common to maintain several legacy distributions of code for this very reason.
A setenv / alias or similar got your R or Python to point to whatever version you wanted to use.
I’m not sure if the “bundle all packages” is a neat solution. That causes bloat and bug fixes etc. get harder.
One option R should consider is to provide a code translator. Something I can run on legacy code that will transform tricky parts to the new standard automagically.
What you are describing is a fundamental difficulty of software development. Imagine if you had to ensure that stan code written today would function identically with stan version 12.1 ten years from now. Your dev team would kill you. That said, there are much better ways of handling versioning and dependency that would alleviate some of the frustration. see http://arxiv.org/pdf/1303.2140.pdf for one set of proposals.
Hard to know without a reproducible example, but you probably want check.names = FALSE.
I think your code must be a wee bit older than 5 years. If I understood the problem correctly, your file had variable names with underscores (e.g., co_fipsid). In R = 1.9.0, underscores in names are valid, so mangling no longer takes place. Current R is preserving the original variable names, which is a feature that everyone wants.
I’m sorry that this breaks your code, but I have to point out that R 1.9.0 was released in April 2004, which would make your code more like 10 years old. This is frankly too old for you to be justified in the use of such inflammatory language when it doesn’t work.
Martyn:
No, it really was 5-year-old code.
Could you have been still using R 1.9.0 in 2004?? Maybe on some old out-of-the-way machine? It would take considerable archaeology to prove it, but it seems though Martyn is probably right about the timing of the change …
from svn log base/R/character.R :
r30004 | ripley | 2004-06-24 05:13:12 -0400 (Thu, 24 Jun 2004) | 5 lines
1) data.frame(check.names=TRUE) enforces unique names
2) make.names has a new arg allow_
3) remove USE_UNDERSCORE_IN_NAMES
4) attempt to warn if newdata is not used in prediction, and document better
r27076 | ripley | 2003-11-15 05:40:36 -0500 (Sat, 15 Nov 2003) | 3 lines
prepare to allow _ in names
move detection of keywords in make.names to C code
oops, I mean “still using 1.9.0 in 2008”
More detail:
https://github.com/wch/r-source/commit/136d7a08f1847bfdf17ccc36d8b139d68ffb11d7#src/library/base/R/character.R
The worst backwards-compatibility problem I’ve encountered with R is reading in a saved workspace. If you are missing a critical package that’s used by something in the image, it won’t load. You’re stuck as far as I can tell. (At least in base R.)
This invariably bites me when a new version of R comes out and I jump to it a bit too soon (before some obscure package I can’t remember using is updated).
CamelCaseCanBeUseful
CamelCase GivesMeAHeadAche!
coming from “real” coding languages, of the C lineage in particular, the use of ‘.’ as anonymous character remains irritating, particularly as R adopts the cloak of OO semantics/syntax. the ‘.’ is the member identifier. period.
Pingback: Michael Crotty
I would love to read this comment but the link http://sww.sas.com/blogs/microt/index.php?/archives/52-The-importance-of-scripting-compatibility.html seems to be broken. Has anyone been able to find it?