Question 1 of my final exam for Design and Analysis of Sample Surveys

1. Suppose that, in a survey of 1000 people in a state, 400 say they voted in a recent primary election. Actually, though, the voter turnout was only 30%. Give an estimate of the probability that a nonvoter will falsely state that he or she voted. (Assume that all voters honestly report that they voted.)

P.S. The commenters are picking up some of the unintended “Hare and pineapple” ambiguity in my question!

25 thoughts on “Question 1 of my final exam for Design and Analysis of Sample Surveys

  1. (I know this will be wrong but)

    If it should be 300 people saying they voted, then 100 people are lying. 100 people lying out of 400 is 25%. But, 100 people out of 1000 is 10%. So I think it would be p=0.1?

  2. Would the MLE get maximum credit, or do you also want error bars? How would you mark a Bayesian approach with an overly strong prior? (If you allow students to use whatever prior they like they can abuse this to game the question.)

  3. At the risk of looking fooling for misunderstanding or being tired or whatever, here’s my attempt at an answer:

    out of 1000 people surveyed given that 300 are expected to have actually voted and 400 said they voted, then we expect that 100 said they voted but did not. p(did not vote | said vote) = 100/400

    The question asks for P(said vote | did not vote) which by bayes theorem is p(did not vote | said vote) * p(said vote) / p(did not vote)

    this works out to 100/400 * 400/1000 / (1-0.3) = 14% or so

  4. Doing this on the fly on a Friday, but…

    If the population proportion is 0.3, and we draw a sample of size n=1000, then the sampling distribution would be a t-distribution with df = infinity (if using a traditional table). In 95 out of every 100 samples, the estimated sample proportion should lie between 0.2716 and 0.3284, or for our single sample of 1000, this implies that the number of “real” yes votes in our sample could range from 272 to 328 simply due to sampling error. If the remainder of the 400 positive reports are false, then that makes for between 72 and 128 liars. If I understand correctly, P will be the estimated number of sampled voters who lied divided by the *actual* number of nonvoters in the population, making that probability range from 72/700 to 128/700, or 0.103 to 0.183, with a mean of 0.143 (1/7).

    I’ll have to think about this a bit more to convince myself fully that 700 would be the correct denominator. Hoping I would pass in your class…

    p.s. Have enjoyed following your blog since mid-March!

    -Jim

  5. Would need to know how many people did not answer the survey question or would need to know if the survey respondent was forced to answer.

  6. Possibly none. It depends on how much over the differential is due to nonresponse or undercoverage among nonvoters. It’s a mistake to assume that the entire difference consists of people lying.

  7. There is quite obviously a lot of information missing.
    Can I choose any prior of my choice? As konrad indicates, if my prior is that people never ever lie, they will not have lied on the survey and the survey result will just be a tail event in the probability distribution of voting outcomes.
    Furthermore, the fact that 30% of people voted says nothing about an individual’s probability of voting so that it is impossible for the student to compute the probability distribution of survey outcomes under the assumption that participants answer honestly. People’s characteristics will affect their individual probabilities. Maybe 60% of the people in the state are registered voters and these people vote with a probability of 50% while the other 40% didn’t bother to register or they are illegal immigrants, so that they will vote with a probability of 0%?

  8. Here is a solution which does not directly use Bayes’ rule, as some of the above solutions do.

    Let y=1 if the person reports voting, 0 otherwise. Let v=1 if they actually voted, 0 otherwise. Then P(y=1) = 0.4, P(v=1) = 0.3, P(y=1 | v=1) = 1, ignoring sampling variation. We want to know P(y=1 | v=0). By the law of total probability, P(y=1) = P(y=1 | v=1) P(v=1) + P(v=1 | v=0) P(v=0). Plugging in the numbers and solving for the unknown yields P(y=1 | v=0) = 1/7, which is approximately 14%.

  9. So one approach would likely be to calculate the probability that noone lied and then subtract that from 1.

    We all know Bayes Theorem p(NL | d,I) = p(NL | I)*p(d | NL,I)/p(d | I)
    where NL=”noone lied”, d=data

    The number of those who voted can be treated as a sample from a binomial with a known rate of 0.3 (the actual voter turnout rate is known, one could incorporate uncertainty on this parameter, but in the context of the question it is being given as a known quantity). The probability of the data (400 or more responses out of 1000 with a rate of 0.3) given that noone lied is remarkably small. Using R:
    p(d | NL,I) = sum(dbinom(seq(400,1000,1),1000,.3)) = 1.104e-11
    Incidentally, if you found the probability that it was exactly 400 instead of just 400 or more, then p(d|NL,I) = 4.e12
    Assuming a rather trustworthy prior of p(NL | I) = 0.99
    Then p(d | I) = p(d | NL,I)*p(NL | I)+p(d | L,I)*p(L | I) = (1.104e-11)*0.99+(1-1.104e-11)*0.01 = 0.01

    Thus,
    p(NL | d,I) = 0.99*1.104e-11/0.01 = 1.093.e-9

    So basically, given the assumption that the survey members were adequately independent random draws from the population there is no chance that noone lied. This also assumes that the state population is large compared to the survey size, but violating this assumption would only make it worse.

    We could dig deeper and assume that the sample may not have been drawn from a perfectly random sample, but that there is some uncertainty to the voting rate of the population they are sampling from (say they did a cell phone survey and turnout was slightly better for younger voters. With a very conservative uncertainty on the rate within the sampled population of Beta(3,7), p(d | NL,I) ~ 0.23 (23%). As such, assuming the same strong prior on trustworthiness as above, p(NL | d,I) ~ 96.7%.For a somewhat stronger distribution on the rate (assuming the survey was better handled and so using a Beta(30,70)) then p(d | NL,I) drops to ~ 0.023 (2.3%) and p(NL | d,I) ~ 70.0 %

    I hope the above is all done right.

  10. Assuming we’re getting the questions in chronological order (Andrew did say there would be one question per lecture he gave) then it’s pretty clear that the answer is 1/7 and we’re overthinking it.

    I do think the question of how to partition the “excess” 100 voters into liars and sampling noise is interesting, and I think the above commenters are right that it depends heavily on our prior expectations about people’s tendency to lie. But I think that with a flat prior, the most probable value is still 1/7.

  11. 700 did not vote, 100 of them lied (under the the assumption that voters do not lie, which is given). So the it is 1/7.

  12. 400 people said they voted but we know from the last assumption that only 300 of them were honest voters, so 400-300 = 100 non-voters that said they voted. The total number of non-voters is 1000-300 (due to 30% turnout) = 700. So the probability that a non-voter falsely stated they voted = 100/700 = 1/7.

  13. Anonymous: “There is quite obviously a lot of information missing”

    Thats why its called probability :)

    Just for fun, here is my solution for the 95% confidence interval.
    Using the binomial distribution for each possible number of voters in the sample of 1000 (up to 400 since voters never lie), the probability for each possible number of liars in the sample of 1000 is computed. Then, assuming a flat prior for the fraction of liars, f, the probability for f can be calculated using the beta function for each possible number of liars. For example, if only 10 people voted (P = 7E-136) then the number of liars is 390 out of 990 non-voters. The probability for each f is then the weighted sum for all possible numbers of liars.

    Peak of the distribution is at 0.143 (1/7, as expected)
    approx 95 % interval is [0.102, 0.180]

  14. 1/7, if sample or random with respect to voting behavior… But I guess that would be to easy so – Whats the catch?

  15. Pingback: Question 2 of my final exam for Design and Analysis of Sample Surveys « Statistical Modeling, Causal Inference, and Social Science

  16. Andrew, do you find that these unintended ” Hare and Pineapple” elements in test questions can actually stimulate student’s thinking or just lead to lots of student complaints? Maybe both?

Comments are closed.