Question 12 of my final exam for Design and Analysis of Sample Surveys

12. A researcher fits a regression model predicting some political behavior given predictors for demographics and several measures of economic ideology. The coefficients for the ideology measures are not statistically significant, and the researcher creates a new measure, adding up the ideology questions and creating a common score, and then fits a new regression including the new score and removing the individual ideology questions from the model. Which of the following statements are basically true? (Indicate all that apply.)

(a) If the original ideology measures are close to 100% correlated with each other, there will be essentially no benefit from this approach.

(b) If the original ideology measures are not on a common scale, they should be rescaled before adding them up.

(c) If the original result was not statistically significant, the researcher should stop, so as to avoid data dredging and selection bias.

(d) Another reasonable option would be to perform a factor analysis on the ideology mea- sures and create a common score in that way.

Solution to question 11

From yesterday:

11. Here is the result of fitting a logistic regression to Republican vote in the 1972 NES.

Income is on a 1–5 scale. Approximately how much more likely is a person in income category 4 to vote Republican, compared to a person income category 2? Give an approximate estimate, standard error, and 95% interval.

Solution: On the logit scale, the estimate is 0.66 with se 0.12. The 95% interval is [0.66 +/- 2*0.12] = [0.42,0.90]. To switch to the probability scale, divide by 4 and round down: the estimate is then 0.16 with se 0.03, 95% interval is [0.10,0.22].

5 thoughts on “Question 12 of my final exam for Design and Analysis of Sample Surveys

  1. I actually prefer to interpret logistic regression results in terms of odds and odds ratios. This corresponds with the main way logistic regression simplifies/models the data; just like linear regression would simplify the effect of income by assuming that a unit increase in income is associated with an increase by a fixed number regardless of how much income one had to begin with, so does logistic regression simplify the effect of income by assuming that a unit increase in income is associated with an increase by a fixed factor regardless of how much income one had to begin with. These assumptions can of course be relaxed by adding polynomials, splines, etc, but it shows what the “natural metric” for the effects in the different models is. Staying within the natural metric of the model can be particularly helpful in more complicated model, e.g. model with interactions.

    This leaves the problem that odds and odds ratios have such a bad (and undeserved) reputation for being hard to interpret that most people in the audience/readers no longer even try to understand them. I usually solve that by first describing a relevant baseline odds in substantive terms without using the name odds and than move on to the odds ratio.

    So in this case my answer would be that for those in income category 2 there is approximately half a person voting Republican for every person voting something else (presumably Democrat) and this odds of voting republican is almost twice as large in income category 4.

    • This was more or less exactly my answer. I also find odds fairly intuitive to work with. Once you switch the probabilities you fix the baseline – which may or may not be sensible. Odds are thus useful when you don’t know the correct baseline, it doesn’t exist or you are comparing effects in groups where the baselines differ. I once read the distinction characterized as odds being useful for science and probabilities for application.

      The bad reputation of odds ratio seems to stem from medicine where odds ratios are criticized for being poor approximations to risk ratios. However, this assumes that you want the risk ratio (or that the risk ratio is somehow the correct way to measure effects) and also assumes that you have a meaningful, useful baseline.

  2. There’s a lot going on in question 11. I’m guessing there’s a low % getting it completely correct, if only 50% got a much easier early question right (question 1, IIRC).

  3. Pingback: Question 13 of my final exam for Design and Analysis of Sample Surveys « Statistical Modeling, Causal Inference, and Social Science

Comments are closed.