Some questions from our Ph.D. statistics qualifying exam

In the in-class applied statistics qualifying exam, students had 4 hours to do 6 problems. Here were the 3 problems I submitted:

  1. In the helicopter activity, pairs of students design paper ”helicopters” and compete to create the copter that takes longest to reach the ground when dropped from a fixed height. The two parameters of the helicopter, a and b, correspond to the length of certain cuts in the paper, parameterized so that each of a and b must be more than 0 and less than 1. In the activity, students are allowed to make 20 test helicopters at design points (a,b) of their choosing. The students measure how long each copter takes to reach the ground, and then they are supposed to fit a simple regression (not hierarchical or even Bayesian) to model this outcome as a function of a and b. Based on this model, they choose the optimal a,b and then submit this to the class. Here is the question. Why is it inappropriate for that regression model to be linear?
  2. You are designing an experiment where you are estimating a linear dose-response pattern with a dose that x can take on the values 1, 2, 3, and the response is continuous. Suppose that there is no systematic error and that the measurement variance is proportional to x. You have 100 people in your experiment. How should you allocate them among the x=1, 2, and 3 conditions to best estimate the dose-response slope?
  3. It is sometimes said that the p-value is uniformly distributed if the null hypothesis is true. Give two different reasons why this statement is not in general true. The problem is with real examples, not just toy examples, so your reasons should not involve degenerate situations such as zero sample size or infinite data values.

You can try to do these at home, also try to guess which of these problems were easy for the students and which were hard.  (One of them was solved correctly by all 4 students who took the exam, while another turned out to be so difficult that none of the students got close to the right answer.) I’ll post solutions over the next three days.

63 thoughts on “Some questions from our Ph.D. statistics qualifying exam

  1. I’d guess (1) was easiest, (2) was next and (3) was the hardest. Unless I’m missing something, I don’t see how (1) would take more than a minute to answer.

  2. (1) Please do not tell me that you put a sneaky Buckingham’s Pi Theorem question on the test. I’m hoping the answer is simply that the solution will always be at the corners of the input domain.

    (2) If you use weighted least squares as your estimator, I got the optimal around 37,0,63, which seems a little odd.

    (3)
    (i) Discrete test statistics will definitely not have this property.
    (ii) One that comes to mind is a classic regression problem where y = bx + epsilon, where epsilon ~ N(0,1) and we assume that b = 0. For a fixed set of inputs x1, x2, … ,xn, the p-value will be uniform. However, if x1, x2, …, xn are generated from something odd, say a gamma distribution, the p-value will not be uniform.

    • Hi Matt,
      3 (i) seems correct, but regarding 3 (ii), is the simulation below what you had in mind?
      X is gamma distributed, b = 0, epsilon ~ N(0,1), and Y ~ bx+eps?
      Seems like for this setup p-values are still uniformly distributed. Just curious if I understood correctly…

      R code:

      pval <- function () {
      x <- rgamma(500,shape = 1,rate = 1)
      y <- 0*x + rnorm(500,0,1)
      return(summary(lm(y~x))[[4]][2,4])
      }
      hist(replicate(n = 1000,simplify = TRUE,pval()))

    • I got the same solution to (2) as Matt (i.e. allocate to x=1 and x=3 in the proportion 1:sqrt(3)), but then I changed my mind. Nothing in the problem says that the proportionality constant in the variance is known. If you plug the MLE for this constant into the expression for the variance of the slope parameter then it simplifies, and the optimum allocation is different.

      The precision for the slope parameter is proportional to A(x)/H(x) – 1, where A(x) is the arithmetic mean of x over the sample and H(x) is the harmonic mean. This passes the sanity test that if you allocate all individuals to the same x the precision is zero because the two means are the same. Intuitively, to maximize A(x)/H(x), I would say you need to allocate all individuals at the extremes x=1 and x=3 leaving none at x=2. If you assume this then you can write the precision as a function of the proportion (p say) in group 3. It’s a quadratic expression that can be easily optimized at p=0.5.

      This may still be the wrong answer, but I’m convinced question 2 is the hardest one and I probably shouldn’t think of applying for a PhD at Columbia.

  3. I haven’t thought about 2 for more than a few minutes, and I don’t have “standard formulas” memorized in my brain in case this is the kind of thing that appears on Stats major’s homework all the time. But here are the considerations I’d think about to try to derive something:

    a) We’re asked to derive “the slope” which means that we need to be fitting a line, otherwise the slope is different at different doses. So we have to assume linear response. This may be obvious to stats guys, but seems counterintuitive to someone who actually models things for a living, drugs usually do nothing at tiny doses, something good at medium doses, and poison you at high doses. Linearity only makes sense over a range that is small, but that’s probably what you’d be testing anyway.

    b) Variance is proportional to dose, so we’re going to have a harder time getting good averages for larger doses. Therefore we need more N allocated to the larger doses.

    c) Since we’re fitting a line, dose 2 is “irrelevant” (though in practice I think it’d be wrong to allocate none to dose 2, just because you might have nonlinearity and you won’t be able to detect it without some dose 2.

    d) Intuitively then, for the homework type conditions, you’d allocate equal precision to dose 1 and dose 3, the standard error is like sqrt(V_i/N_i), and V_i = D_iV where D is the dose, so you want sqrt(3/N_3) = sqrt(1/N_1) and N_1+N_3=100, solve for the N…

    how’d I do? :-)

    • A propos of what Nick Menzies said above, I find your answer amusing in that you just about state what I think to be the answer, and then mess it up in your calculations. But I’m holding up my answer. Daniel: take your part (a) seriously and then answer the question…. Or, to quote our host here, Use Your Prior Information!

      • Well, as I said, it was the 2 minute version. My guess now is that you mean that at dose 0 the response has to be 0. I’d need to know more about how the dose and the response were defined before concluding that, but it’s a reasonable assumption. (for example, are we talking about the dose being an “additional dose” above a baseline? like if we add fluoride doses 1,2,3 to water, but the water has some natural fluoride anyway. also, is the response on a logarithmic scale or something, so 0 doesn’t mean what we think it does..)

        Thinking more about that, the slope we’d get fitting through 0,0 and one other point has error sqrt(DV/N)/D = sqrt(V/ND). Minimizing this means putting all the points on dose 1.

        • By this reasoning, would it matter where you put the samples? As variance is a fixed proportion of the distance from the origin, wouldn’t any allocation get the same result? As an aside — feels a lot like a Poisson model.

        • Calculate the average response R for the dose, divide by the dose to get slope, so R/D = slope, the D is assumed exact, and the standard error in the average over R is sqrt(DV/N) where V is the proportionality constant for the variance, the dose is D, so the standard error in the slope should be sqrt(DV/N)/D which leads to sqrt(V/ND) which is minimized when D = 3 which is what I meant above but said you should use dose 1 instead (basically because I wasn’t paying enough attention while trying to get kids out the door).

          Doing math in distracted blog comment form is prone to a lot more errors than in other contexts.

          As a check on my reasoning, imagine that we take D out to a really large number. The *variance* grows linearly, but the sqrt(variance) grows like sqrt of D, so R/D is getting more and more accurate as you increase D

          from simulation, calculate the mean of R/D and compare to k, the known mean, do it 100 times, and take the standard deviation of the ratios.

          accuracy accuracy(1,1,1);
          [1] 0.105113
          > accuracy(1,1,2);
          [1] 0.06951886
          > accuracy(1,1,3);
          [1] 0.0485634
          >

        • ack, blog ate my R code.

          accuracy <- function(V,k,D){sd(replicate(100,mean(rnorm(100,k*D,sqrt(V*D))/D/k)))}
          > accuracy(1,1,1);
          [1] 0.105113
          > accuracy(1,1,2);
          [1] 0.06951886
          > accuracy(1,1,3);
          [1] 0.0485634
          >

    • Whoops, re-reading the question, it says “linear” right there in the problem statement. It still seems weird to me, because I think of this kind of thing as usually having an “optimal effective dose” and see problem 1 for linear and optimal.

    • Yeah I was thinking along the same lines, but why would you have 2 the empty level instead of 3, given that the variance increases with the dose? (therefore each observation in 2 is a little more informative than an observation in 3)

  4. For 1, the first thing that comes to me is that the RHS variables are bounded between 0 and 1. After thinking about it a little more, I would add that what you’re measuring is time to fall a certain height. I’m not exactly certain how drag impacts the force diagrams here, but if you assume that at t=0 and h=0 that you only have acceleration a=g+D where D is drag, then you get h=(g+D)*t^2. So for a given amount of height, then t is a non-linear function of D (Drag).

    For 2, I’m not an expert on WLS/GLS, so I don’t know how to quantify best slope estimate (maybe some way similar to the OLS estimates). I was thinking n_1=60, n_2=30, n_3=10, this way each group contributes the same amount to overall variance.

    I couldn’t answer 3 off the top of my head. whuber has some comments on this stats.stackexchange (http://stats.stackexchange.com/questions/10613/why-are-p-values-uniformly-distributed-under-the-null-hypothesis) that explains why it might not be true though.

  5. For 1, the first thing that comes to me is that the RHS variables are bounded between 0 and 1. After thinking about it a little more, I would add that what you’re measuring is time to fall a certain height. I’m not exactly certain how drag impacts the force diagrams here, but if you assume that at t=0 and h=0 that you only have acceleration a=g+D where D is drag, then you get h=(g+D)*t^2. So for a given amount of height, then t is a non-linear function of D (Drag).

    For 2, I’m not an expert on WLS/GLS, so I don’t know how to quantify best slope estimate (maybe some way similar to the OLS estimates). I was thinking n_1=60, n_2=30, n_3=10, this way each group contributes the same amount to overall variance.

    I couldn’t answer 3 off the top of my head. whuber has some comments on this stats.stackexchange (http://stats.stackexchange.com/questions/10613/why-are-p-values-uniformly-distributed-under-the-null-hypothesis) that explains why it might not be true though.

  6. 2 seems the toughest to me (and seems the likeliest that someone thinks they understand it better than they actually do).

    You can apply WLS, but don’t you have to figure out which allocation gives the lowest WLS error for the slope estimate? That doesn’t seem that trivial to me but I could be missing something.

      • I worked out the formula for the variance of the WLS slope estimate in terms of the allocation to each dose amount, but it seemed to have so many terms in both the numerator and the denominator that it didn’t seem worth optimizing since I’m not actually taking the exam. The problem does seem quite a bit easier if the model assumes that the line goes through the origin (in that case the variance turns out to be a constant).

        • I’m now imagining that is what was expected on the exam, that it goes through (0,0). But I am not familiar enough with the context to understand how a response, that could be anything, is guaranteed to be 0 at no dosage.

  7. 1. A linear model would allow the chosen ‘optimal’ a and b to be below 0 or above 1. A logistic regression or probit would be more appropriate.

    2. Because we are estimating a linear dose response model, we want to be as certain as possible about two of the three possible treatment effects (assuming that the measurement error does not increase as a function of the number of samples taken). Because measurement variance is proportional to x, we will ignore the dose level X=3 entirely (our measure of the response would be the noisiest for x=3). We now need to decide how to divide up the 100 people into the treatments x=1 and x=2. Because measurement variance is proportional to x, we would have 33 people in the x=1 group and 66 people in the x=2 group.

    3. Attenuation bias due to measurement error can make estimates seem less significant than they would be if variables were measured correctly. Also, sample sizes aren’t actually infinite, so the distribution of the p-values will never be completely uniform.

  8. Actually… I think for problem (1), it doesn’t matter where you allocate the people — if you assume that the line needs to go through (0,0) and that the true response is actually linear.

    Here’s some intuition for it. Imagine at each point the samples you got (i.e. the measured effect) is generated from a normal distribution centered around the true value, and variance proportional to the value.

    Now let’s imagine we took a sample at one of the points. One of the ways to generate a random normal, is to to draw a random sample from a uniform random variable (on 0-1) and then use the inverse CDF to estimate where the point is. If you do this assuming that the variance is proportional to the value (i.e. at point 1, var = sigma, and at point 3, var = sigma*3), then the CDF evaluated at points 1, 2, and 3, all fall along a line, allowing you to interpolate safely between conditions when doing the final linear estimation.

    Essentially we are predicting where that point “would have been” had it been in a different condition.

    This in turn suggests you can do any assignment of individuals to conditions — the end result will be the same.

    We can test this intuition with an example. Assume the underlying function is y = x, and the variance at x = 1 is 1. Here’s some R code that will generate where each point is on each sample.

    unifEstimate = runif(1, 0, 1)
    xAtOne = qnorm(unifEstimate, 1, 1)
    xAtTwo = qnorm(unifEstimate, 2, 2)
    xAtThree = qnorm(unifEstimate, 3, 3)

    We have that xAtOne, xAtTwo, and xAtThree, all fall on a line which intersects the origin. This means if we got a value in condition 1, we can essentially say (under the assumptions of linearity and proportional variance) where that value would have been if it had been generated in condition 3.

    You can make this argument analytically in the simple case above. I expect that a similar argument will work more generally. Leading me to beg the question and humbly suggest that there is no “best” answer. Any distribution will work equally well.

    • Here’s another way to take a look at this by explicitly simulating the entire data set, and seeing what values you’d have gotten for different sample sizes. It turns out that the estimated linear models are all the same, assuming the number of people in the population are constant.

      We can marginalize over “randomness” by setting the random seed in set.seed(int). different values of int will lead to different seeds.

      Rcode:

      estimate = function(groupOne, groupTwo, groupThree) {
      variance = 1
      slope = 1
      set.seed(2); #Generates the same random numbers each time. Must be an integer.
      individuals = c(rep(1, groupOne), rep(2, groupTwo), rep(3, groupThree))
      responses = c()
      for(ii in 1:length(individuals)) {
      responses[ii] = rnorm(1, slope*individuals[ii], variance*individuals[ii])
      }
      weight = 1/(individuals)**2 #To get the weights right for unequal sample variances.
      #I'm only mostly sure this is right. :-)
      return(lm(responses~0+individuals, weights = weight))
      }
      estimate(20, 20, 20)
      estimate(20, 25, 15)
      estimate(15, 25, 20)

    • Agreed, if you assume that the line goes through the origin, then it doesn’t matter how you allocate the people. I was able to derive this using the formula for the variance of the OLS estimate — it comes out to a constant that doesn’t depend on the x-values.

      But why assume that the line goes through the origin?

      • Because it’s a “response.” How can a dose of zero have a response? (Note that zero doesn’t mean a placebo… it means nothing.) Thus, there’s no logical way to get a response from nothing. Suppose 10 percent of the time the disease goes away on its own. That 10 percent needs to be subtracted, because it’s not a response to any level of treatment. As Daniel Lakeland said above, response needs to be measured as an offset.

        • And if you don’t trust that, take one person and don’t treat them. The variance of their response is zero, by assumption. Thus, you cannot have ten percent respond spontaneously, or the variance of zero dose wouldn’t be zero.

      • This is just reiterating Jonathan (the other one).

        There is no dose, so how can there be a response? It seems like any response you measure should be zero. It may be the case that there is some baseline level of response to no treatment (0 dose).

        My guess is that in most cases this baseline level of response is well known, or negligible. If it is well known, they we can just use that as our intercept point, which leads to the same math as assuming its at (0, 0).

        If negligible, then the response level at 0- will be close to zero, letting us assume that the intercept is (0, 0).

        I think this problem would be more interesting with a more complex regression, but I certainly can’t find any fault in the elegance of the solution. :-)

  9. I hate variance, it’s a quantity invented by mathematicians who prefer the elegance of not taking a square root. When you calculate the variance of a real world quantity, the dimensions are all wrong, someone asks you “what’s your height” you say 170cm for example, if someone asks you “how much uncertainty is there in that measurement?” you don’t say 9 cm^2, it makes no sense.

  10. Slope is (y2 – y1) / (x2 – x1), which suggests that the increased variance at dose 3 is offset by the decrease in variance of the slope estimate from the increased “baseline”. I think this is the same thing Andrew Whalen is saying.

  11. A serious question here: How confident are you that such exams have sufficient validity and reliability to determine whether someone is allowed to pursue their chosen profession?

    Certainly, qualifying exams like this are not (and really cannot be) up to the standards we impose on, say, teacher licensure exams.

    • RO:

      I hate the way our department (and others) use these exams. But I’ve never been able to convince people to get rid of them. Meanwhile, we are asked to supply questions for the exam so I do so. If it were up to me, I would make these accept/reject decisions based on students’ course grades, which I think could supply a lot more information.

      • Andrew:
        Instead of throwing out the exam, have you thought about replacing it with something more research related? A timed test is a very similar to the way grades are decided, so you will not get a ton of new information. A test that requires the ability to do read a paper, understand the paper, and talk about future work would demonstrate a student has the capability to finish a thesis. This may be a way to check if a student is “qualified” to continue in the program.

      • To offer you a data point from chemical engineering; many departments have started getting rid of qualifying exams in the last decade.

        I’m not sure. I think they still serve a useful purpose. If administered correctly.

        What I found most amusing was @RO’s using teacher licensure exams as the gold standard. Did I miss irony?

      • Fair question, and I think that the validity of these exams is questionable. However, to the extent that the validity is in the test development process (and ultimately I think that’s pretty much where we have to live), at least with the Praxis series, the process is research-based and inclusive of stakeholders, including both university faculty and groups such as the NEA. Moreover, there is a somewhat-active research agenda based around these exams that can be found here: http://search.ets.org/researcher/query.html?fl0=KW%3A&ty0=p&op0=&tx0=Teacher%20Licensure

        Of course, such a research agenda is probably literally impossible with PhD qualifying exams, which is OK, but the stakes of these exams are so high that I find it deeply troubling.

        • Reminds me of a recent study that showed that of all the factors empirically correlated with being an effective HS teacher, a masters degree in Education was the one with zero impact.

        • RO,

          Thanks for your response.

          I’m not convinced that the stakes of Ph.D. qualifying exams are as high as you seem to believe. In my experience, students usually have two shots at them — and sometimes a third try is available on petition. Those who don’t make it after all tries can usually arrange to get a master’s degree, which usually leaves them in a good position for a variety of good jobs. Sure, there’s a blow to the ego — but that can come in lots of ways in life, not just in failing qualifying exams. The vast majority of people are able to recover and get on with meaningful lives.

  12. In 2, the question states that the response is linear, but it doesn’t state that the proportional variance is linear. In fact, it doesn’t even state that the proportional variance is positive. I would also guess it’s a least squares problem, but that sort of math goes above my knowledge level. And no, it should not just be an even split between 1 and 3. There are many possible variance slopes where 1 and 2 are the best.

  13. I think best is possibly the important word in 2. There are many unbiased estimators including the OLS, so if that’s all you care about you could say 33 34 33, but I think I would preemptively adjust for the variance. I’d say 54, 27 and 18. That’s because I want to give more weight to the better (lower variance) data, but I still want to use the full range of data including the possibility of finding that the linear fit is bad. (If you only have 2 doses you can’t tell this.) I got my numbers by solving a+a/2 +a/3 = 100. Similar to John Hall above.

  14. Pingback: Solution to the problem on the distribution of p-values - Statistical Modeling, Causal Inference, and Social Science Statistical Modeling, Causal Inference, and Social Science

  15. Pingback: Solution to the helicopter design problem - Statistical Modeling, Causal Inference, and Social Science Statistical Modeling, Causal Inference, and Social Science

  16. Pingback: Solution to the sample-allocation problem - Statistical Modeling, Causal Inference, and Social Science Statistical Modeling, Causal Inference, and Social Science

Leave a Reply

Your email address will not be published. Required fields are marked *