Question 1 of my final exam for Design and Analysis of Sample Surveys

Posted on May 11, 2012 4:00 PM by Andrew

1. Suppose that, in a survey of 1000 people in a state, 400 say they voted in a recent primary election. Actually, though, the voter turnout was only 30%. Give an estimate of the probability that a nonvoter will falsely state that he or she voted. (Assume that all voters honestly report that they voted.)

P.S. The commenters are picking up some of the unintended “Hare and pineapple” ambiguity in my question!

25 thoughts on “Question 1 of my final exam for Design and Analysis of Sample Surveys”

Matt on May 11, 2012 4:14 PM at 4:14 pm said:

Estimate is easy. Have to stop and think about getting a confidence interval though…
moss on May 11, 2012 4:17 PM at 4:17 pm said:

(I know this will be wrong but)

If it should be 300 people saying they voted, then 100 people are lying. 100 people lying out of 400 is 25%. But, 100 people out of 1000 is 10%. So I think it would be p=0.1?
Hack on May 11, 2012 4:20 PM at 4:20 pm said:

[0.117, 0.169]
Anonymous on May 11, 2012 4:25 PM at 4:25 pm said:

1/7?
konrad on May 11, 2012 4:57 PM at 4:57 pm said:

Would the MLE get maximum credit, or do you also want error bars? How would you mark a Bayesian approach with an overly strong prior? (If you allow students to use whatever prior they like they can abuse this to game the question.)
Daniel Lakeland on May 11, 2012 5:10 PM at 5:10 pm said:

At the risk of looking fooling for misunderstanding or being tired or whatever, here’s my attempt at an answer:

out of 1000 people surveyed given that 300 are expected to have actually voted and 400 said they voted, then we expect that 100 said they voted but did not. p(did not vote | said vote) = 100/400

The question asks for P(said vote | did not vote) which by bayes theorem is p(did not vote | said vote) * p(said vote) / p(did not vote)

this works out to 100/400 * 400/1000 / (1-0.3) = 14% or so
stats_jim on May 11, 2012 6:21 PM at 6:21 pm said:

Doing this on the fly on a Friday, but…

If the population proportion is 0.3, and we draw a sample of size n=1000, then the sampling distribution would be a t-distribution with df = infinity (if using a traditional table). In 95 out of every 100 samples, the estimated sample proportion should lie between 0.2716 and 0.3284, or for our single sample of 1000, this implies that the number of “real” yes votes in our sample could range from 272 to 328 simply due to sampling error. If the remainder of the 400 positive reports are false, then that makes for between 72 and 128 liars. If I understand correctly, P will be the estimated number of sampled voters who lied divided by the *actual* number of nonvoters in the population, making that probability range from 72/700 to 128/700, or 0.103 to 0.183, with a mean of 0.143 (1/7).

I’ll have to think about this a bit more to convince myself fully that 700 would be the correct denominator. Hoping I would pass in your class…

p.s. Have enjoyed following your blog since mid-March!

-Jim
Am I missing something? on May 11, 2012 6:36 PM at 6:36 pm said:

1/7?
Kurt on May 11, 2012 6:37 PM at 6:37 pm said:

Would need to know how many people did not answer the survey question or would need to know if the survey respondent was forced to answer.
awm on May 11, 2012 7:09 PM at 7:09 pm said:

Possibly none. It depends on how much over the differential is due to nonresponse or undercoverage among nonvoters. It’s a mistake to assume that the entire difference consists of people lying.
- awm on May 12, 2012 10:51 AM at 10:51 am said:
  
  Jon Krosnick and some others have an ANES working paper where they looked into this very issue and found that for the most part people aren’t lying and that the sorts of people who participate in surveys about elections are disproportionately the sort of people who vote. http://www.electionstudies.org/resources/papers/nes012554.pdf
Anonymous on May 11, 2012 7:55 PM at 7:55 pm said:

There is quite obviously a lot of information missing.
Can I choose any prior of my choice? As konrad indicates, if my prior is that people never ever lie, they will not have lied on the survey and the survey result will just be a tail event in the probability distribution of voting outcomes.
Furthermore, the fact that 30% of people voted says nothing about an individual’s probability of voting so that it is impossible for the student to compute the probability distribution of survey outcomes under the assumption that participants answer honestly. People’s characteristics will affect their individual probabilities. Maybe 60% of the people in the state are registered voters and these people vote with a probability of 50% while the other 40% didn’t bother to register or they are illegal immigrants, so that they will vote with a probability of 0%?
Anonymous on May 11, 2012 8:11 PM at 8:11 pm said:

I like 1/7 from Bayes
Metrics Magician on May 11, 2012 9:46 PM at 9:46 pm said:

Here is a solution which does not directly use Bayes’ rule, as some of the above solutions do.

Let y=1 if the person reports voting, 0 otherwise. Let v=1 if they actually voted, 0 otherwise. Then P(y=1) = 0.4, P(v=1) = 0.3, P(y=1 | v=1) = 1, ignoring sampling variation. We want to know P(y=1 | v=0). By the law of total probability, P(y=1) = P(y=1 | v=1) P(v=1) + P(v=1 | v=0) P(v=0). Plugging in the numbers and solving for the unknown yields P(y=1 | v=0) = 1/7, which is approximately 14%.
Mike G on May 11, 2012 10:57 PM at 10:57 pm said:

So one approach would likely be to calculate the probability that noone lied and then subtract that from 1.

We all know Bayes Theorem p(NL | d,I) = p(NL | I)*p(d | NL,I)/p(d | I)
where NL=”noone lied”, d=data

The number of those who voted can be treated as a sample from a binomial with a known rate of 0.3 (the actual voter turnout rate is known, one could incorporate uncertainty on this parameter, but in the context of the question it is being given as a known quantity). The probability of the data (400 or more responses out of 1000 with a rate of 0.3) given that noone lied is remarkably small. Using R:
p(d | NL,I) = sum(dbinom(seq(400,1000,1),1000,.3)) = 1.104e-11
Incidentally, if you found the probability that it was exactly 400 instead of just 400 or more, then p(d|NL,I) = 4.e12
Assuming a rather trustworthy prior of p(NL | I) = 0.99
Then p(d | I) = p(d | NL,I)*p(NL | I)+p(d | L,I)*p(L | I) = (1.104e-11)*0.99+(1-1.104e-11)*0.01 = 0.01

Thus,
p(NL | d,I) = 0.99*1.104e-11/0.01 = 1.093.e-9

So basically, given the assumption that the survey members were adequately independent random draws from the population there is no chance that noone lied. This also assumes that the state population is large compared to the survey size, but violating this assumption would only make it worse.

We could dig deeper and assume that the sample may not have been drawn from a perfectly random sample, but that there is some uncertainty to the voting rate of the population they are sampling from (say they did a cell phone survey and turnout was slightly better for younger voters. With a very conservative uncertainty on the rate within the sampled population of Beta(3,7), p(d | NL,I) ~ 0.23 (23%). As such, assuming the same strong prior on trustworthiness as above, p(NL | d,I) ~ 96.7%.For a somewhat stronger distribution on the rate (assuming the survey was better handled and so using a Beta(30,70)) then p(d | NL,I) drops to ~ 0.023 (2.3%) and p(NL | d,I) ~ 70.0 %

I hope the above is all done right.
David J. Harris on May 12, 2012 12:12 AM at 12:12 am said:

Assuming we’re getting the questions in chronological order (Andrew did say there would be one question per lecture he gave) then it’s pretty clear that the answer is 1/7 and we’re overthinking it.

I do think the question of how to partition the “excess” 100 voters into liars and sampling noise is interesting, and I think the above commenters are right that it depends heavily on our prior expectations about people’s tendency to lie. But I think that with a flat prior, the most probable value is still 1/7.
Mike2 on May 12, 2012 2:36 AM at 2:36 am said:

p = 1/7, 95% CI = .099 to .186
DK on May 12, 2012 2:44 AM at 2:44 am said:

Out of ~700 non-voters, ~100 lied. Hence 1/7. Is this too simple-minded? Do I really need to invoke Bayes?
mandy on May 12, 2012 2:54 AM at 2:54 am said:

700 did not vote, 100 of them lied (under the the assumption that voters do not lie, which is given). So the it is 1/7.
MCA on May 12, 2012 2:57 AM at 2:57 am said:

400 people said they voted but we know from the last assumption that only 300 of them were honest voters, so 400-300 = 100 non-voters that said they voted. The total number of non-voters is 1000-300 (due to 30% turnout) = 700. So the probability that a non-voter falsely stated they voted = 100/700 = 1/7.
tom campbell-ricketts on May 12, 2012 5:21 AM at 5:21 am said:

Anonymous: “There is quite obviously a lot of information missing”

Thats why its called probability :)

Just for fun, here is my solution for the 95% confidence interval.
Using the binomial distribution for each possible number of voters in the sample of 1000 (up to 400 since voters never lie), the probability for each possible number of liars in the sample of 1000 is computed. Then, assuming a flat prior for the fraction of liars, f, the probability for f can be calculated using the beta function for each possible number of liars. For example, if only 10 people voted (P = 7E-136) then the number of liars is 390 out of 990 non-voters. The probability for each f is then the weighted sum for all possible numbers of liars.

Peak of the distribution is at 0.143 (1/7, as expected)
approx 95 % interval is [0.102, 0.180]
Olivia on May 12, 2012 5:27 AM at 5:27 am said:

1/7, if sample or random with respect to voting behavior… But I guess that would be to easy so – Whats the catch?
- Olivia on May 12, 2012 5:31 AM at 5:31 am said:
  
  IS random, not or…
Pingback: Question 2 of my final exam for Design and Analysis of Sample Surveys « Statistical Modeling, Causal Inference, and Social Science
JSB on May 12, 2012 4:39 PM at 4:39 pm said:

Andrew, do you find that these unintended ” Hare and Pineapple” elements in test questions can actually stimulate student’s thinking or just lead to lots of student complaints? Maybe both?

Comments are closed.