The data and your statistical model can’t tell you whether there are time trends, or if there are why the are occurring. You need to go outside statistics, to thinking and hypothesizing.

“But if you do a very careful study (so as to minimize variation) or a very large study (to get that magic 1/sqrt(n)), you’ll get a small enough confidence interval to have high certainty about the sign of the effect. So, from going from high sigma and low n, to low sigma and high n, you’ve “adding time or sample to an experiment” and you “found a result.””

Doing a “very careful study (so as to minimize variation)” again, involves thinking about the problem in a qualitative way and introducing controls based on theory. The “very careful” part is theory, not data, driven. This is a VERY different and more effective way to reduce uncertainty than increasing sample size, which I don’t believe will generally reduce uncertainty to acceptable levels in dirty data. It is the epidemiological approach vs the experimental approach. Stopping rules tell you where to stop in the former case; that’s not good enough.

]]>http://aidsperspective.net/blog/?p=749

Sounds like affirming the consequent fallacy rampant throughout medical research I’ve noticed (If the drug works people who get it will survive longer, people who got the drug survived longer therefore the drug works). I’ve seen comments by Kary Mullis’ (who invented pcr) regarding the early days of HIV testing saying the method at the time was not capable of detecting virus at the levels claimed. It’s not my area of expertise so I will stop there, but I would not doubt that fields could continue along the wrong path for decades under the current environment of mass confusion over how to interpret evidence along with publication bias.

Also this paper contains a nice discussion of stopping rules:

“It is an interesting sub-paradox that the seemingly hard-headed and objective p-value approach leads to something as subjective as the conclusion that the meaning of the data depends not only on the data but also on the number of times the investigator looked at them before he stopped, while the seemingly fuzzy subjective formulation leads to the hard headed conclusion “data are data”.”

Cornfield, Jerome (1976). “Recent Methodological Contributions to Clinical Trials”. American Journal of Epidemiology 104 (4): 408–421.

http://www.epidemiology.ch/history/PDF%20bg/Cornfield%20J%201976%20recent%20methodological%20contributions.pdf

He compares three bayesian stopping rules in addition to an NHST based one. Stopping based on bayes factors and accepting/excluding the ROPE introduces bias in the parameter estimate. Stopping based on precision seems to be the way to go. A great quote is: “A stopping rule based on getting extreme values will automatically bias the sample toward extreme estimates”

Cheers,

Jonas

Sure, but that utility makes no sense. Ultimately, the utility should make sense in terms of dollars and lives, or quality of life, or whatever. There is no logic to the utility function that you gave.

]]>U(th, ¬C) = 0 for all th

U(th, C) = -95 if th < 0

U(th, C) = +5 if th > 0

Then the expected utility EU[¬C] = 0, while EU[C] = P(th<0) * -95 + (1-P(th<0)) * 5, which will be greater than EU[¬C] IFF P(th>0) > 0.95

]]>We have a whole chapter on decision analysis in BDA, so I certainly don’t mind the idea of utility analysis. I don’t think there is a true utility function but I think utility functions are useful for clarifying tradeoffs in decision problems. We discuss further in Section 9.5 of BDA. Also, I don’t mind inferring utility functions from data, I just think you have to be clear that preferences involve a lot more than utilities.

]]>Do you have a different view of utility functions in general now? Or is it that you don’t like working backward from, say, choice data to infer something about unobserved utility functions, whereas in full Bayesian decision making, you get to make your own utility function and then use it prescriptively?

If it’s the latter, what’s the problem with inferring utility functions from data?

]]>No, that doesn’t work. You need a utility that is a joint function of your decision and the unknown theta. The utility function you gave just depends on theta and thus does not imply any decision recommendation at all.

]]>If my utility were the step function that is -95 when th is less than 0 and +5 otherwise, then my Bayesian decision rule would be to use the treatment iff P(th>0) > 0.95, which is thus affected by the stopping rule. Even if it’s an unrealistic utility, isn’t that cause for concern?

]]>I do think that formalizing costs and benefits can be a good idea—there’s no general way that I know to do this, I think that at our current stage of understanding it just needs to be done anew for each problem. One advantage of formalizing costs and benefits is that it can make you think harder about what you’re really concerned about in your estimation problem. That said, I don’t usually do this sort of formal decision analysis in my own work.

]]>Maybe this is in your new book, but I haven’t finished reading it yet.

]]>In Bayesian decision making, there is a utility function and you choose the decision with highest expected utility. Making a decision based on statistical significance does not correspond to any utility function.

]]>And using an HPD yields pretty similar decisions. I wouldn’t even know what the alternative criterion for a decision would be in this very practical setting.

[Of course, I don’t yet know whether I did the homework right!]

]]>Sanborn, A. N. & Hills, T. T. (in press). The frequentist implications of optional stopping on Bayesian hypothesis tests. Psychonomic Bulletin & Review. http://www2.warwick.ac.uk/fac/sci/psych/people/asanborn/asanborn/frequentist_implications.pdf

Rouder, J. N. (in press). Optional Stopping: No Problem For Bayesians. Psychonomic Bulletin & Review.

http://pcl.missouri.edu/sites/default/files/r_0.pdf

Cheers,

E.J.

Nicely put

]]>We then divide this set [of possible results] by a system of ordered boundaries…such that as we pass across one boundary and proceed to the next, we come to a class of results which makes us more and more inclined on the Information available, to reject the hypothesis tested in favour of alternatives which differ from it by increasing amounts.

In the case of optional stopping, the set of possible results is any *observed difference & N combo* that stops the experiment, so the “system of ordered boundaries” needs to be set up for all possible N. Once you’ve defined such a test procedure T, you can observe the experimental result and then back out the Type I error rate that puts the result on the boundary of the rejection region: that’s your p-value.

…Or you could always just chuck the observed difference, pretend you only observed N, and base your test and p-value on that. That’s pretty much what frequentist who wanted actual results had to do back in the Stone Age (that is, back when computations were chiselled on stone tablets, or written in notebooks in pen, or whatever it is people did back then).

]]>I assume you are talking about the case where one is doing a bayesian analysis. In a frequentist setting, that *would* be foolish, no? I just want to have it out there so that I don’t start getting people telling me Andrew Gelman says it’s OK to keep running an experiment till you hit significance ;)

What I started doing very recently in such a situation (where more data would help) is to re-run the experiment and use the previous data as a prior. I hope that’s not too crazy.

]]>It’s all about cost. No point in spending lots of $ on extra data if they’re not needed. Conversely, if there is a lot of uncertainty it can make sense to gather more data. It would be pretty foolish to just sit there with your prechosen N, if you think you can learn something useful by increasing your sample.

]]>Indeed, there are a lot of people like you, which is one reason that Bayesians probably should be worrying more about the impact of stopping rules on inferences.

]]>Just to add a bit more about this “fairness” thing (maybe we need to do a joint blog post on it…): It seems reasonable for a basketball game to have a symmetry principle, that any stopping or scoring rule has to be symmetric relative to the team labeling, for example if you stop after team A is up by 20 points, then you have to stop after team B is up by 20 points. For a medical trial, though, I don’t see this, as I’d think it would be rare that an analysis is symmetric in any case. (For example, the existing treatment and the new treatment are typically not treated symmetrically.)

]]>There are some differences between the basketball story and the scientist story. In basketball you have a winner, in science you are doing inference. For example, if this were a science example and you had 2 drugs and you kept sampling until drug A wins . . . this isn’t such a realistic rule, because if A is much worse, it’s quite possible you’ll never (in finite time) get to a point where A wins. Especially if you have a rule that N has to be greater than some minimum value such as 40. Also, if A wins and the difference is clearly noise (e.g., 8/20 successes for A and 7/20 successes for B), that won’t be taken as strong evidence.

So to apply your story to science, you’d need to have a minimum sample size, a maximum sample size, and some rule that you only stop if A is statistically significantly better than B. Even so, yes, you will sometimes see that happen, and a data-dependent stopping rule can increase the probability of stopping at that point—but, yes, this is a frequentist argument and indeed I don’t think it will hurt a full Bayesian analysis if there is no underlying time trend in the probabilities of success.

As I said above, though, a data-dependent stopping rule could cause damage if someone is mixing Bayesian inference with non-Bayesian decision rules. And indeed people do this all the time, I’m sure (for example, performing a Bayesian analysis and then making a decision based on whether the 95% posterior interval excludes zero). So in that sense it could create a problem.

To go back to the basketball example: in a Bayesian analysis, your posterior probability of which team is better is changing a bit with each score. But sports is about winning, not about inference: a team wins the game if they scored more points, not because there is an inference (Bayesian or otherwise) that they are the better team.

Perhaps this last point will be clear if I return to the sample size analogy. Suppose two players are competing: now consider an individual sport, in this case taking shots 30 feet from the basket, and the ref gives 1 shot to player A and 1000 shots to player B. The prize goes to the player with a greater success record.

It’s really hard to make the shot from 30 feet, so player A will almost certainly get 0 successes. But player B gets so many tries, he’ll probably have some success, maybe 10% or 5% or whatever. The point is, Player B will almost certainly win. So you get unfairness, but with no data-dependent stopping rule. The problem is with the decision rule. Having a decision rule that satisfies certain fairness properties is a hard problem. It’s true that by restricting the stopping rules in certain problems, you can get the fairness properties that seem so intuitive, but you lose something too (in the medical example, you might give a less preferred treatment to someone). Is it worth it, this tradeoff? It depends on how much you care about the fairness property. It’s hard for me to see the justification of it, really; I think it’s an Arrow’s-theorem-like situation where there are certain properties that intuitively seem desirable but, on second thought, aren’t worth the effort.

]]>Discussion of stopping rules and inference usually begin with the frequentist position that the repeated sampling principle is more important than any other consideration, so I am very pleased to read this post.

It is commonplace to view data-dependent stopping rules as problematical because of the increase in risk of type I errors, but at the same time as those false positive error increases the risk of false negative errors declines. I’ve played around with simulations and in almost all situations the false negative rate declines much faster with increasing sample size than the false positive rate increases. Thus even within the inferentially depleted world of frequentists who use dichotomous outcomes there are inferential advantages to data-dependent stopping rules. Does anyone know why such stopping rules are nearly universally assumed to be deleterious?

]]>If the game is called unexpectedly because the electricity went out to the gym, it seems fair to take the score at the time of the power failure as good data. But if the referee calls the game as soon as his favored team pulls into the lead, that hardly seems fair.

How is the basketball story different from the scientist collecting data until he gets the result he wants?

]]>I don’t think I understand point 2: suppose we interpret p as “under H0 the probability of this event occurring within N observations is less than p”, then we wouldn’t we calculate the same p-value however N was chosen (whether predetermined or by a stopping rule).

(& I came up with another qualification in a Bayesian world: we infer different things from “I expected an effect size of 4 and found an effect size of 4”, vs “I expected an effect size of 16 and found an effect size of 4”. It is genuinely informative to know the experimenter’s expectations. And their choice of sample size tells us something about their priors. If they use a stopping rule then it can be potentially misleading in that direction, and we learn something when we find out the experimenter had to recruit 4 times as many subjects as he or she originally intended.)

]]>