Solution to the sample-allocation problem

See this recent post for background.

Here’s the question:

You are designing an experiment where you are estimating a linear dose-response pattern with a dose that x can take on the values 1, 2, 3, and the response is continuous. Suppose that there is no systematic error and that the measurement variance is proportional to x. You have 100 people in your experiment. How should you allocate them among the x=1, 2, and 3 conditions to best estimate the dose-response slope?

And here’s the answer:  The estimate of the regression slope is simply (xbar_3 – xbar_1)/2 [ummm, not in general, see George’s comment here], and the variance of this estimate is (sigma_1^2/n_1+ sigma_3^2/n_3)/4. There’s no use putting any observations on the x=2 condition so we can write n_3=100-n_1, so the variance we are trying to minimize is (sigma_1^2/n_1 + sigma_3^2/(100-n_1))/4.  Differentiate this with respect to n_1 (and ignore the constant factor of 1/4) and set the derivative to 0:

-sigma_1^2/n_1^2 + sigma_3^2/(100-n_1)^2 = 0.

Thus the optimum occurs when n_1/(100-n_1) = sigma_1/sigma_3.

It says above that the variance is proportional to x, so sigma_1/sigma_3 = 1/sqrt(3), thus n_1/(100-n_1) = 1/sqrt(3).

That is, n_1 and n_3 are in the ratio 1 to sqrt(3).  So n_1 = 100*1/(1+sqrt(3)) and n_3 = 100*sqrt(3)/(1+sqrt(3)), which gives n_1=37 and n_3=63.  (I rounded because you can’t take a fraction of a person.)

Of my three problems on the exam, this one was the most straightforward.  It’s a problem that can be solved by brute force.  So I was surprised and disappointed that none of the students came even close to the right answer.

I guess it’s a harder problem than I thought it was (or maybe I’m overgeneralizing from the n=4 students who took the exam).

P.S. In comments some people asked why we are assuming linearity; why not estimate a nonlinear relation? The answer is that, with moderate data, it can be difficult to estimate something nonlinear, especially if you’re going to insist on statistical significance. See here for an article from 2000 exploring this point.

54 thoughts on “Solution to the sample-allocation problem

  1. How realistic is the assumption “measurement variance is proportional to dose” in real world situations.

    Empirical linear models of this sort seem common in literature ( i.e. dose-response, quantity-activity, concentration-speed, quality-satisfaction etc.)

    So, should we be mostly only testing at the two extreme doses? Seems very non-intuitive to me.

      • Not to test the assumption of linearity at all seems quite problematic, though. I guess it might be my Social Science background but I’d never just believe a linear relationship just because any kind of theory would predict it.

        • For sure, I would never just believe it, further you have two means and you get the line from those means but if the middle mean happened to be not perfectly on the line the whole thing would pivot in the direction of that mean. I mean why assume no error in those two means? I think it’s pretty problematic to estimate the size of a linear effect from only two means. But maybe that’s just social science-y. I actually have a lab I do with my students where we start with two points and ask them “did the treatment work?” and then add different data and find out the seemingly effective intervention is sometimes an illusion (and same for the seeming non-effect of another pair of points). I don’t think a dosage study would assume linearity for the same reason. Assuming linearity and estimating a linear fit are two different things.

          I had some interesting results when I was comparing effect sizes and p values for various combinations including the one Andrew suggests.

        • The problem says that you have 100 data points to allocate.

          That’s not a lot, if the variance is significant, to detect nonlinearity with any reliability. We’ve all seen data sets where that would be problematic with such a small amount of data.

          But it’s moot. Andrew specified a linear model, and that’s what the students trying to answer this question should deal with.

      • Yes, but there seem to be tons of studies working on the assumption of linearity and then estimating slopes from a regression. Let’s assume the model is so well established that the linearity itself is not in question.

        Should these studies be only measuring at the two extremes? I’m just saying I don’t see that being done much in practice. So I’m wondering why.

        Mostly people measure at all “doses”. Is that sub-optimal?

      • > The assumption of linearity (and that the linear model is correct) means that the middle observation is irrelevant.

        Yes. One can just look at the expression for regression confidence intervals. From the standpoint of obtaining the tightest intervals you want to make your measurements at the extremes of the range – provided you believe the linear model is correct. Sampling points in between tests the assumption of linearity but it doesn’t tighten confidence intervals.

  2. Here is a modified version of your argument:

    “The estimate of the regression slope is simply (xbar_2 – xbar_1)/1, and the variance of this estimate is (sigma_1^2/n_1+ sigma_2^2/n_2)/1. There’s no use putting any observations on the x=3 condition so we can write n_2=100-n_1, so the variance we are trying to minimize is (sigma_1^2/n_1 + sigma_2^2/(100-n_1))/1. Differentiate this with respect to n_1 (and ignore the constant factor of 1/1) and set the derivative to 0:

    -sigma_1^2/n_1^2 + sigma_2^2/(100-n_1)^2 = 0.

    Thus the optimum occurs when n_1/(100-n_1) = sigma_1/sigma_2.

    It says above that the variance is proportional to x, so sigma_1/sigma_2 = 1/sqrt(2), thus n_1/(100-n_1) = 1/sqrt(2).

    That is, n_1 and n_2 are in the ratio 1 to sqrt(2). So n_1 = 100*1/(1+sqrt(2)) and n_2 = 100*sqrt(2)/(1+sqrt(2)), which gives n_1=41 and n_2=59. (I rounded because you can’t take a fraction of a person.)”

    It seems that the reason this is inferior to your solution is the variance of this solution is roughly 3 times the variance of your solution. I presume if we instead used doses 2 and 3, the variance would also be greater than your solution. Is there an intuition for why the variance should be lowest if we use 1 and 3, rather than 1 and 2, or 2 and 3? Thanks in advance.

  3. If instead of just three dose values, one could allocate 100 subjects anywhere on 1 <= x <= 3 would the answer still be 37 & 63 at x=1 & x=3?

    Basically always at the boundaries & in square root proportion?

    • I think that the optimum will be on the boundaries as long as the variance doesn’t change much with the dose and the quadratic term on delta_x in the denominator dominates. But if the variance for x=2 was ten times as big as for x=1, and the variance for x=3 was one hundred times as big, the optimal solution would be to assign 24 patients to x=1 and 76 to x=2. In this case, the increase in precision obtained by making the interval larger doesn’t compensate the increase in variance for larger doses.

      In any case, I don’t think it’s completely obvious that “no use putting any observations on the x=2 condition”, but the following might help. Assume we had patients on x=1 and x=2. Moving the patients from x=2 to x=3 improves the precision because the interval length grows from 1 to 2 and the related coefficient decreases from 1 to 1/4. The increase in the variance of the measure is lower (50%). Asume we had patients on x=2 and x=3. Moving those in x=2 to x=1 is clearly beneficial, as both the increase in the interval and the reduction in the measurement variance have positive effect. Again, this reasoning is valid only when the change in variance along the axis does’t offset the quadratic reduction in the variance as the interval length grows.

  4. “a linear dose-response pattern with a dose that x can take on the values 1, 2, 3” seems to imply that you’d want to test 1,2,3. I think that may have been where they got thrown off.

  5. Do intermediate measurements make sense when the measurement variance is not proportional to x but some other functional?

    For both constant as well as linear variation we know end-point-only designs are the best but are there any other credible variance relationships to consider?

    • > Do intermediate measurements make sense when the measurement variance is not proportional to x but some other functional?

      Heteroskedasticity will be the death of me;-)

      If you had a list of available x-values, x_i, (i.e., independent variable values) and you knew what the measurement variance was at each x-value then you just take the derivative of… the determinant of the parameter covariance matrix (yes?) with respect to n_i and then solve the resulting system of equations to determine the optimum values of n_i.

      • > … then you just take the derivative of… the determinant of the parameter covariance matrix

        What I’m thinking is that you’d want to choose n_i values to minimize the determinant or trace of the Fisher information matrix, i.e., the inverse of the parameter covariance matrix. (I omitted “inverse” from my original text. Small difference:-)

  6. I agree there’s no point putting any data at x=2, but I don’t think Andrew’s argument justifies it. Specifically, the first line;

    “The estimate of the regression slope is simply (xbar_3 – xbar_1)/2”

    Here’s a counterexample:

    x = rep(1:3, times=c(20,40,40))
    y1 = c(1:20, 1:40, 11:50)
    y2 = c(1:20, -1:-40, 11:50)
    coef(lm(y1~x))
    coef(lm(y2~x))

    The value of the mean of the observations at x=2 affects the slope estimate. The value of the slope estimate, in general, relies on the data from all three observations.

      • Yeah, your point about it being a harder problem than it looks seems right to me.

        I think the easiest way to get the result is by invoking Gauss-Markov to establish inverse-variance weighted OLS as the best estimator, then optimizing the variance of the resulting slope estimate with respect to the design (governed by two free parameters). Given that the solution lies on the boundary of the space, that optimization would need to be done pretty carefully.

        One might also expect an astute examinee to worry about how well the sample variance could be estimated (as noted by Martyn in the original post’s comments).

        Without Gauss-Markov it’s a four-dimensional optimization problem – where one considers how to weight the different observations – not easy, under exam circumstances. Although one could rule out any estimate except OLS in the question, that’d make it easier.

        • Exactly, this is what I was saying in the other thread about how the x2 mean will leverage the estimate in its direction relative to the line connecting the mean at x1 and mean at x3. Because you can’t assume that they will precisely line up even if there is a linear relationship.
          Likewise putting more observations at one end or the other moves the sample means up or down and in the presence of three groups (but not two) leverages the line differently.
          And yes I think you have to say whether or not you would be allowed to use WLS (or something else) to do the estimating. If you are allowed to correct for putting so many where there is more variance that changes things.

        • If you assume n2 = 0, and you do OLS (y = ax + error) with some fixed n1 and n3 (n1+n3=100) and then minimize the variance of a wrt to n1, do you expect to get the same answer that Andrew gets? I did it quickly and find n3 ~= 66, but haven’t checked this carefully. I don’t see off hand why Andrew’s answer should match OLS, but I may be missing something.

        • By my algebra, even invoking Gauss-Markov and ignoring how well one can estimate the variance, the student still has to minimize;

          (n[1]/1+n[2]/2+n[3]/3)/((1*n[1]+2*n[2]+3*n[3])*(n[1]/1 + n[2]/2 + n[3]/3) – 100^2)

          where n[i] is the number of observations with x=i, so sum(n[1:3])=100. The minimum does indeed lie at (37,0,63) but proving that with paper and pencil – as I assume they have to – seems really quite difficult. And its arguably not a good test of whether should do a PhD in statistics.

          Even if you want to argue directly, and perhaps more intuitively, that the weighted LS slope estimate has variance inversely proportional to the variance of the corresponding weighted sample variance of the X’s, you have to convince yourself that the weighted sample variance of the X’s is indeed maximized by putting all the weight on X=1 and X=3. I don’t think that’s obvious – for example, with other patterns of heteroskedasticity the optimum design has some X=2 observations.

          Hope you didn’t fail anyone on account of not answering this question.

        • George:

          I’ve said this a few times in this comment thread but I’ll say it again: I don’t like these exams. But if we are going to have such an exam, I think this is a reasonable question. We do give partial credit, and I do think this sort of problem represents some of the things one can and should learn as a statistician.

        • Thanks. My earlier comment was wrong – I don’t get n3 = 66.

          But I’m still confused about something very basic in this problem. Andrew says that the regression slope is (xbar_3 – xbar_1)/2. Modulo the question of whether it is obvious that n2 should equal zero, Andrew’s expression seems intuitively sensible. But it doesn’t seem to be the slope that OLS with a zero intercept linear model would give (zero intercept because there should be zero effect at zero dose).

          Here is how I interpret the problem in R. I assume a real slope of 1.

          n1 <- 5
          n2 <- 0
          n3 <- 100 – n1

          x1 <- rep(1,n1)
          x3 <- rep(3,n3)

          y1 <- rnorm(n1, 1, 1)
          y3 <- rnorm(n3, 3, sqrt(3) )

          z <- data.frame(x = c(x1,x3), y = c(y1,y3) )

          Then "a" is given by

          lm(y~ x + 0 , data = z)$coefficients

          or more simply by

          a <- sum(z$x*z$y)/sum(z$x^2)

          In general, this "a" does not equal ( mean(y3) – mean(y1) )/2

          as can be verified by simply running the code. If that's the case, and I haven't made a stupid mistake,
          are the optimal n1,n2,n3 what Andrew gets?

          (I think it is likely I am misunderstanding something here, but I am not sure what.)

        • Actually, to reply to myself, it is my assumption that y ~ ax (setting the intercept to be zero) that implies that the optimal solution is to put 100 observations at x = 3. This is why I wasn’t getting the same answer as Andrew. I agree that if instead one assumes y = ax+b, you get Andrew’s solution.

        • One final question on this (I promise), directed to Andrew. Why isn’t the above solution – put all 100 points at x = 3 – a better solution than putting 37 observations at x = 1 and 63 at x=3? If one is going to assume linearity, isn’t it also reasonable to assume that there should be no constant term in the model – that a dose of zero has no effect? The variance of the slope in the model with no intercept is also ~1/3 of the variance with an intercept which also seems desirable.

        • Jeffrey:

          No, the problem said nothing about the line going through 0. To estimate an effect, you need to take the difference to compare to a baseline. See this paper for an example. In general in statistics we have to consider at least two values of x to estimate a slope.

  7. “In comments some people asked why we are assuming linearity; why not estimate a nonlinear relation? The answer is that, with moderate data, it can be difficult to estimate something nonlinear, especially if you’re going to insist on statistical significance.”

    I’m probably misunderstanding this statement, but aren’t we estimating nonlinear relations whenever we use a GLM?

  8. “In comments some people asked why we are assuming linearity; why not estimate a nonlinear relation? The answer is that, with moderate data, it can be difficult to estimate something nonlinear, especially if you’re going to insist on statistical significance.”

    There is a difference between hypothesis of strict linearity and estimating something linear because of data limitations. If your true underlying function y=y(x) is not strictly linear (but approximately so on [1,3]) then you have to define what is that “linear part” that you want to approximate. Maybe you want something like min (Integral_1^3 (y(x)-ax-b)^2 dx) and take a as a characteristic linear trend. Then you might want (depending on what you think about the nonlinear part) to trade some of the variance for less bias and include some of the x=2 measurements.

    But exam question is just that. It is not an open research questions and students usually understand that they should show basic competence and not some zany ability of deep thinking.

    • I don’t necessarily think people were saying to estimate non linearity, it was that there may be non linearity in the underlying relationship. So when you estimate with intermediate points you might get a slope closer to 0 as one possibility.

  9. “…..why we are assuming linearity; why not estimate a nonlinear relation? The answer is that, with moderate data, it can be difficult to estimate something nonlinear,….”

    Sometimes the worry is a piece-wise linear behavior. Something changes the slope at a threshold.

    Is this case the same as Andrew’s “nonlinear”? Or is this case more amenable to estimation & hence a valid reason for using interior design points as are often seen in practical studies?

  10. Does “measurement error proportional to x” ever make sense in practice? In most cases, x = 0 is a real option (even if we aren’t collecting that data in this particular experiment), but we would still have measurement error in that case.

    This would result in putting more observations on the smaller values of x — in this case, more observations for x = 1.

    So that’s a bit contrived, but I like the idea of working heteroscedastic errors into this problem.

    • Yes indeed. So I think if you gave a free choice of x here, the optimal solution would be 1 observation at x=0 (then the intercept is known! ) and the rest as large as possible.

    • I’m not into medical research at all – could someone give me a hint why measurement variance is bigger at larger doses?
      If the response grows with dose, I would even assume that the relative deviation from the mean is smaller at larger doses. 0.2 units measurement uncertainty on 5 units of effect is relatively bigger than 0.2 on 20 units of effect. Or did I miss a conceptual difference between measurement precision / uncertainty with variance?

      • Under this model the standard deviation only increases with the square root of the mean. Meaning the ratio of standard deviation to mean *does* still get smaller as the dose increases. The ratio of variance to the mean stays constant, but really when thinking about this ratio you should be thinking about the errors on the same scale as the mean.

        Why would the variance increase in absolute terms at larger doses though? Afraid I don’t know anything about biological models either, but the Poisson distribution is a good example of a model where variance is proportional to the mean, so if you can imagine the dose determining the rate of a Poisson process, it might be a plausible model.

      • Say each dose level involves an extra tablespoon of medication. Each tablespoon has a .2 unit of measurement error. So dose of 3 has more error. Or say it is 1,2, and 3 pills. So with the three pills there is more chance to accidentally take 2 or 4 and missing a dose has a bigger impact also.

        • Matthew: thanks for the reference to poisson – Some R code illustrating this is at the end.
          Elin: Thanks for the tablespoon example – that does make sense to me.
          Rahul: Since we’re talking “units”, this is standard deviation that is linear proportional to dose, I would think. That would indicate your second option is correct…
          But I don’t design experiments like in this task, so dont’ trust my opinion too much.

          plot(0:20, dpois(0:20, lambda = 0.5), pch=16)
          points(0:20, dpois(0:20, lambda = 1), pch=16, col=2)
          points(0:20, dpois(0:20, lambda = 2), pch=16, col=3)
          points(0:20, dpois(0:20, lambda = 5), pch=16, col=4)
          points(0:20, dpois(0:20, lambda =10), pch=16, col=5)

        • Right, so we’d need a different story for variance, but I’m sure someone could come up with it. I’ve certainly heard this idea of variance proportional to mean before, although probably more in textbooks.

  11. I think I would have asked the students about the lack of observations at x=2, something like “why might you, in practice, take some observations at x=2?”, fishing for assessing linearity.

  12. I discussed these questions with my advisor, and he seemed to think that, for this question, the best strategy would be to have all 100 samples at x=3. It took me a bit to realize why he might have said this, but I think it was because a linear dose-response function might not make much sense if the response to a dose of x=0 was anything but 0.

    Then again, neither of us do much drug research (at least, I know I don’t and I don’t think he does much, if any) on our own, so this thought might be a bit off-base. Would there be a good reason for a non-zero response at a dose of x=0?

    • yes, a placebo effect: subjects each take a pill with x=0 grams of the drug in it, but have a non-zero mean response.

      for the rest of the function, subjects taking pills containing higher levels of the drug have mean response that changes linearly with x. This makes sense, no?

      finally, note that “linear” does not mean “goes through the origin”.

      • Yes, exactly. But I can see how a student, reading the question, could think that there is some trick involved. That’s one of the difficulties of exam questions, as compared to real-world problems.

      • Note, though, that by the assumption of the problem, they can’t have a non-zero mean response… they need to have exactly the same non-zero response, since the problem specifies that reasponse variance at a dose of zero is zero. But if every human being on earth has exactly the same non-zero response whether they are given the drug or not, how is this even a “response?” That’s what I meant when I think the problem implies no placebo effect, except in the incredibly unlikely situation that the placebo effect is exactly the same as the effect of… nothing.

        • Again (I said it in the previous thread… If you let me take just one of my sample and give them a zero dose, I can easily beat your allocation with 99 at dose=3. Granted, that violates the spirit of the question, but it’s a much better solution!

        • I think that what bothers me about this problem is partly the use of dose-response language. The question would have had less discussion if it had been just “you want to model a linear relationship” rather than you want to do a dose response study. Dose response studies are designed to do things like determine the point at which a drug becomes toxic or determine the minimum effective dose or the level of carcinogen which should cause a product to be removed from the market. (This is an interesting older paper about some of the issues raised in the question. http://statistics.ucla.edu/preprints/uclastat-preprint-1993:24 (though they assume homoskedacity)). That is, they are usually conceptually about non linear relationships. If they aren’t you end up with “no safe dose”, or as the EPA says “Cancer Risk = Exposure x Slope Factor” i.e. there is always a non 0 risk in absence of a finding of a threshold http://www.epa.gov/risk_assessment/dose-response.htm.

          I know Andrew said in a comment that there really is a limitation to 1,2,3; that would be the case in dose response study if you already knew that 4 was toxic, which is not information that is given in the problem. But let’s presume that we already know that the treatment is efficacious at 1 (so ignore 0 dose) and that anything above 3 would be toxic. Then Andrew’s idea makes sense because you are only piecewise estimating a more or less linear part of the non linear curve that falls from 1 to 3 and using just the data from 1 and 3 would be most efficient and not endanger participants.

        • Elin:

          Perhaps it depends on what applications you work on. Linear dose-response seems like a reasonable model to me, it’s what we used in our radon study. More generally, in the pharmacology and environmental health studies I’ve worked on, it’s always been about estimating the function, never about estimating any discrete threshold such as “the point at which a drug becomes toxic” or “the minimum effective dose.”

          Yes, we do use our inferences in later decision analyses, but in the inferential stage it’s always (for me) the estimation of a continuous function, with the discrete decision making coming later, not in the dose-response model.

          But the applications you’ve worked on may be different, I recognize that.

  13. Count me in as another who thought that the most reasonable interpretation of a linear dose/response model would be a line though the origin. Of course something may happen with zero dose, but it’s not a response to the dose (of what?), because there wasn’t one.

    Hence getting the answer wrong. But hopefully getting a reasonable mark, for providing a correct answer to a slightly different question :-)

Leave a Reply

Your email address will not be published. Required fields are marked *