Named variables

I always tell students to give variables descriptive names, for example, define a variable “black” that equals 1 for African-Americans and 0 for others, rather than a variable called “race” where you can’t remember how it’s defined. The problem actually came up in a talk I went to a couple of days ago: a regression included a variable called “sex”, and nobody (including the speaker) knew whether it was coding men or women.

P.S. Yet another example occurred a couple days later in a different talk (unfortunately I can’t remember the details).

P.P.S. I corrected the coding mistake in the first version of the entry.

P.P.S. Check out Keith’s story in the comments.

8 thoughts on “Named variables

  1. If "black" equals zero for African-Americans, then doesn't the coefficient (say, in an OLS regression), actually describe the impact of being not black? In my mind, it turns "on" when equal to one for non-African-Americans and adds the value of the coefficient to the dependent variable. So I'd expect "black" to equal 1 for African-Americans and zero otherwise. Is there a standard practice?

  2. There is another perspective of not letting the analyst "know" what the codes are until the anlysis has been completed.

    As an aside, years ago this saved my job when in a methodological presentation I was to give with my clinical boss, they were unable to attend and I did not know the coding in the example data set. I informed the audience that I did not know in his absence whether "dead" was coded as 0 or 1. This lead to a somewhat formal request that I should be fired from the hospital for not knowing whether pateints were dead or alive. My boss later replied that he preferred that I not know the coding until after the analysis was done.

    Of course, one needs to be aware of the coding and get it right when presenting results to others.

  3. This reminds me of a story I once heard that, unfortunately, seems too funny to be true.

    Someone was presenting a paper to a generalist audience. The author had a variable called sex, coded 0 for female, 1 for male. Someone raised their hand and angrily wondered why men where given a higher score than women. The author replied something along the lines of, “Oh that's just a dummy.'' The audience member only became angrier, thinking she, or all women, were being accused of being "dummies". Since then, the author always names the variable “female'' and codes women as 1. Like I said, I can't vouch for the veracity of the anecdote.

  4. Whenever I teach students, 99% of them code male = 1, female = 2. So females get the higher score, but males come first. I agree that a descriptive name for a dummy is best (I always use male or female) but the alternative, when there are more than two categories, is alphabetical order, so female = 1, male = 2. You can still reconstruct lost codes (as long as you know the categories, of course).

  5. I, too, prefer dummy variables with obvious names when the number of distinct values is relatively small. When categorical variables have larger numbers of distinct values, however, there is a trade-off between clarity and parsimony.

  6. To you this may look like an unlikely anecdote. But it will happen to you _whenever_ you will present in front of a feminist audience.

    There were bookshelves written on this topic in feminist and postmodern literature saying that dichotomies are as such the devil's work (not in these words). And the male/female coding is their favorite example.

    btw, calling a variable "sex" would be wrong from their position in the first place as it needs to be called "gender". They will remind you of this, I'd bet my car on it.

Comments are closed.