Don’t say “improper prior.” Say “non-generative model.”

Posted on June 18, 2017 9:47 AM by Andrew

In Bayesian Data Analysis, we write, “In general, we call a prior density p(θ) proper if it does not depend on data and integrates to 1.” This was a step forward from the usual understanding which is that a prior density is improper if an infinite integral.

But I’m not so thrilled with the term “proper” because it has different meanings for different people.

Then the other day I heard Dan Simpson and Mike Betancourt talking about “non-generative models,” and I thought, Yes! this is the perfect term! First, it’s unambiguous: a non-generative model is a model for which it is not possible to generate data. Second, it makes use of the existing term, “generative model,” hence no need to define a new concept of “proper prior.” Third, it’s a statement about the model as a whole, not just the prior.

I’ll explore the idea of a generative or non-generative model through some examples:

Classical iid model, y_i ~ normal(theta, 1), for i=1,…,n. This is not generative because there’s no rule for generating theta.

Bayesian model, y_i ~ normal(theta, 1), for i=1,…,n, with uniform prior density, p(theta) proportional to 1 on the real line. This is not generative because you can’t draw theta from a uniform on the real line.

Bayesian model, y_i ~ normal(theta, 1), for i=1,…,n, with data-based prior, theta ~ normal(y_bar, 10), where y_bar is the sample mean of y_1,…,y_n. This model is not generative because to generate theta, you need to know y, but you can’t generate y until you know theta.

In contrast, consider a Bayesian model, y_i ~ normal(theta, 1), for i=1,…,n, with non-data-based prior, theta ~ normal(0, 10). This is generative: you draw theta from the prior, then draw y given theta.

Some subtleties do arise. For example, we’re implicitly conditioning on n. For the model to be fully generative, we’d need a prior distribution for n as well.

Similarly, for a regression model to be fully generative, you need a prior distribution on x.

Non-generative models have their uses; we should just recognize when we’re using them. I think the traditional classification of prior, labeling them as improper if they have infinite integral, does not capture the key aspects of the problem.

P.S. Also relevant is this comment, regarding some discussion of models for the n:

As in many problems, I think we get some clarity by considering an existing problem as part of a larger hierarchical model or meta-analysis. So if we have a regression with outcomes y, predictors x, and sample size n, we can think of this as one of a larger class of problems, in which case it can make sense to think of n and x as varying across problems.

The issue is not so much whether n is a “random variable” in any particular study (although I will say that, in real studies, n typically is not precisely defined ahead of time, what with difficulties of recruitment, nonresponse, dropout, etc.) but rather that n can vary across the reference class of problems for which a model will be fit.

125 thoughts on “Don’t say “improper prior.” Say “non-generative model.””

Mayo on June 18, 2017 12:04 PM at 12:04 pm said:

I thought priors were to be based on all kinds of background info, and further that you can change your prior based on data. As it happens, I reblogged a post from 2 years ago today (“Can you change your Bayesian priors?”)
https://errorstatistics.com/2017/06/18/can-you-change-your-bayesian-prior-the-one-post-whose-comments-some-of-them-will-appear-in-my-new-book/
Even default priors require considering the experiment, model, order of importance of parameters. Is it more legitimate for priors to be generated from “in here” (beliefs) rather than “out there” (data)? Data-dependent specifications in general may be pejorative for a frequentist, as with many data dependent hypotheses and stopping rules, but not always. When they are problematic, the problem is reflected in illicit or changed error probabilities. How would the use of data-dependent priors show up as problematic for a Bayesian? I hear Berger talk derisively about “double use” of data, but to my knowledge he’s never given a Bayesian rationale for prohibiting, or even picking up on, the practice.

Reply ↓
- Carlos Ungil on June 18, 2017 1:22 PM at 1:22 pm said:
  
  You can indeed change your prior based on data. There is a clearly established procedure to do that and the result is called the posterior distribution. But I think this is not what you meant.
  
  As a matter of principle, the prior is the probability distribution before some evidence (the data) is taken into account. The data cannot be part of the prior by definition!
  
  Of course the prior can be (and should be) informed by prior data (i.e. old evidence obtained before the new evidence we want to analyse). And in practice Bayesian methods can be abused, like most things in life. But what other Bayesian rationale to disallow the practice of double-counting data do you need, beyond the very definition of prior?
  
  Reply ↓
  - Anoneuoid on June 18, 2017 1:29 PM at 1:29 pm said:
    
    I think many times the prior is not independent of the data since people will fiddle with the model after looking at the results, run it again, fiddle some more, etc. In this way information about the data will leak into the model (including prior). Actually this is probably all the time unless the model was previously published. You really need to hold out test data until the moment of publication, or wait for more data to be collected after publishing (ie make a priori predictions).
    
    Reply ↓
  - Daniel Lakeland on June 18, 2017 3:33 PM at 3:33 pm said:
    
    There’s also a pretty obvious generalization of Bayesian model selection that makes “picking your model based on data” a fairly obvious approximation of a more general situation.
    
    Suppose you have 2 models you are willing to entertain, and you write up a mixture:
    
    P(Data | Params)P(Params) = P1(Data | Params1) p(Params1| P1) p(P1) + P2(Data|Params2) p(Params2 | P2) p(P2)
    
    and you see that your data essentially rules out say model 2… (an example is model 2 predicts 100% of data will be positive, and there is one or more data point that is negative). Then you can immediately infer p(P2) = 0 and P(P1)=1 and write:
    
    p(Data | Params)p(Params) = P1(Data|Params1) p(Params1)
    
    Externally this “Looks like” you looked at your data and chose your model P1 based on the data. But in fact it produces the same exact inference as you would have gotten if you wrote out the full model and then did the inference.
    
    The same is true, approximately, when your data makes P(P2) very small and you simply truncate it out of the expression as an approximation.
    
    The more general case looks like “survey your data, from among all the classes of models you might be willing to consider, select the ones that seem capable of fitting the data, and use only those in your fit”. The fact is that it’s sometimes obvious after looking at your data that there’s no need to include certain possible models because they simply wouldn’t fit the data, and therefore would wind up with posterior probability sufficiently small that you can ignore them for simplicity. The logic is perfectly clear and perfectly Bayesian, it’s just also more or less approximate depending on the severity of the truncation you’re performing.
    
    Reply ↓
- Andrew on June 18, 2017 2:09 PM at 2:09 pm said:
  
  Mayo:
  
  As I’ve discussed many times, there’s nothing special about the so-called prior distribution. In practice we build up our models through experience, and we can change all aspects of our model—not just the “prior”—in light of information about how the model fits data.
  
  Regarding Berger and the so-called double use of the data, see here. The short answer is that Berger can be as derisive as he wants, but derision is no substitute for clean notation.
  
  Reply ↓
Daniel Weissman on June 18, 2017 1:05 PM at 1:05 pm said:

Typo? Should it be “In contrast, consider a Bayesian model… with *data-independent* prior”?

Maybe I’m misunderstanding, but in the classical model, presumably there would usually be some rule for generating theta (e.g., taking it to be 0, or the sample mean). If you’re doing old-fashioned NHST, the null hypothesis might be a generative model.

Reply ↓
- Daniel Lakeland on June 18, 2017 3:34 PM at 3:34 pm said:
  
  no his point is that a “data dependent prior” doesn’t have a prior predictive distribution, because you need to see the data before you can calculate your prior.
  
  Reply ↓
  - Simon on June 18, 2017 6:35 PM at 6:35 pm said:
    
    I’m with Daniel here. “data-based prior, theta ~ normal(0, 10)” that prior seems data independent.
    
    Reply ↓
    - Bob Carpenter on June 18, 2017 6:48 PM at 6:48 pm said:
      
      I’m pretty sure Andrew means that you set the scale of 10 after seeing the data. It remains generative (in the sense you can generate the parameters from the prior and the data from the parameters) in that it’s not expressed as a function of the data.
      
      See Andrew’s comment above: http://statmodeling.stat.columbia.edu/2017/06/18/dont-say-improper-prior-say-non-generative-model/#comment-509689
    - Andrew on June 18, 2017 9:22 PM at 9:22 pm said:
      
      Bob:
      
      If you set the scale of the prior after seeing the data, that’s a non-generative model. One could consider this as an approximation to some generative model, but as it stands it’s non-generative, and I think that’s relevant to understanding the model and how it works.
    - Daniel Weissman on June 19, 2017 11:27 AM at 11:27 am said:
      
      Right, so there is a typo in the sentence beginning “In contrast…”? That prior is meant to be *data-independent*, not “data-based”, right? (Hence the contrast with the previous examples.)
    - Andrew on June 19, 2017 11:43 AM at 11:43 am said:
      
      Daniel:
      
      Yup, that was a slip; I fixed it.
    - Bob Carpenter on June 19, 2017 3:43 PM at 3:43 pm said:
      
      Could you define what you mean by generative model? It sounds like the history of how you arrived at the model is somehow tied up in the definition rather than being an intrinsic property of the model.
      
      Just to be clear, I’m talking about the model where you look at the data or use some other knowledge to set a normal(0, 10) prior, not a prior that has a term involving the data in it. That seems consistent with being generative in the definition you cite below. Maybe I just misunderstood which example we’re talking about—the one I thought we were talking about isn’t in the list any more!
    - Andrew on June 19, 2017 3:49 PM at 3:49 pm said:
      
      Bob:
      
      A generative model is a joint probability distribution over all data and parameters.
    - Bob Carpenter on June 19, 2017 4:12 PM at 4:12 pm said:
      
      By that definition, I still get a generative model if I look at some data $latex y$, then set up the joint density:
      
      $latex p(y, \mu) = \mbox{Normal}(y \mid \mu, 1) \, \mbox{Normal}(\mu \mid 0, 10)$
      
      So I think we were talking about different models.
    - Andrew on June 19, 2017 4:37 PM at 4:37 pm said:
      
      Bob:
      
      Your example is not fully specified, as you’re saying the model is a function of the data you saw. So rather than writing “10,” you should write something like “g(y)”, where g(y)=10 for the specific data you saw. Another way to see this is that, with a generative model, you should be able to generate mu from its marginal distribution, and then y|mu. Alternatively, you can generate y from its marginal distribution, and then mu|y. In either case, in the model you wrote there is no clearly defined marginal distribution. You’ll have a similar problem with a model such as mu ~ normal(A, 1) and y_1,…,y_n|mu ~ normal(mu, 1), where you set A = y_bar. This can work in practice, and it can be seen as an approximation or limit of certain generative models, but it’s not a generative model because, to generate y you need theta, but to generate theta you need y, and there’s no joint distribution.
      
      To return to your example: Sure, there must be some nontrivial functions g(y) for which your model does correspond to a generative model, albeit not one where y|mu ~ normal(mu, 1). So, at best, you might be lucking into a generative model which is not what it looks like has been specified.
    - Daniel Lakeland on June 19, 2017 5:38 PM at 5:38 pm said:
      
      Andrew, the ising model
      
      p(State) = exp(-H(State))/Z
      
      is a joint distribution. If I then tell you the state of spin_i for i=1…100 and there are 1000 spins, you can call this p(Data,Parameters) where the spin state for the other 900 spins are parameters.
      
      How does this differ from: p(Data,Parameters) = product(normal(Data_i,mu,sigma))*normal(mu,0,100)*gamma(sigma,2,2/sd(Data))
      
      where my prior mu ~ normal(0,100) is clearly data independent, but my prior sigma ~ gamma(2,2/sd(Data)) is not.
      
      Certainly there is a joint distribution over Data and Parameters, and you could sample from it using Stan and use the generated quantities to *generate* pseudo-data.
      
      To me, the difference is that the model *can not be causal* because information propagates backward in time from Data to sigma.
    - Daniel Lakeland on June 19, 2017 5:48 PM at 5:48 pm said:
      
      To clarify what I mean there. I can imagine a causal process in which say some states of the world cause a data generating machine to have a certain sigma, and a certain mu, and then I run the machine forward and I get data based on those fixed parameter values. The physical process that causes the data generating machine to have a given sigma can not itself causally depend on the sigma I will observe after I collect the data.
      
      I can think of a data spitting function that operates on discrete time: Data(t,mu,sigma)
      
      I can’t run this first, and then observe the sd(Data) that comes out, and then go back in time and generate a sigma to plug in.
      
      On the other hand, Bayesian analysis is (a kind of) generalized logic, and I can logically infer what must have been true at a point in time in the past. I just can’t give it a causal interpretation (the data came out with sd(Data) = 1 and this *caused* the machine to plug in 1.15 for its sigma parameter 10 minutes earlier)
    - Andrew on June 19, 2017 6:06 PM at 6:06 pm said:
      
      Daniel:
      
      Interesting point, the extent to which a generative model on parameters & data implies some sort of factorization. I’ll have to chew on this one.
    - Daniel Weissman on June 19, 2017 11:24 AM at 11:24 am said:
      
      Simon: Easy to say “I’m with Daniel” in this case!
    - Daniel Lakeland on June 19, 2017 2:13 PM at 2:13 pm said:
      
      +1!
Daniel Lakeland on June 18, 2017 1:18 PM at 1:18 pm said:

Andrew: I have a model right now where I am predicting something and doing something along the lines of:

Observed / Predictor(ParameterVector,Covariate) ~ gamma(5,4);

I consider this to be a likelihood type expression but it’s not directly a generative likelihood. To generate from this you’d do:

a = generate(ParameterVector)
b = generate(gamma(5,4))
c = generate(PseudoCovariate)

Pseudo_Observed = b*Predictor(a,c)

from the perspective of probability expressions we’ve got

p(Observed/Predictor(Parameter,Covariate) | Parameter,Covariate)

which is basically like a regression formula, think y – f(x,Par) = epsilon instead of y = f(x,Par) + epsilon

Do you consider this kind of model to be generative? I’ve never really understood the *precise* definition of a generative model. When something is obviously generative… it’s easy, but in a model like this where generating the psuedo-data takes several transformations I don’t know whether it “counts”

Reply ↓
Dustin Tran on June 18, 2017 8:03 PM at 8:03 pm said:

Hi, I agree with the sentiment. But I think there there are a few subtleties that get clouded here:

+ Models with improper priors are non-generative a priori but are typically generative a posteriori. “Non-generative” conflicts with those who still think generatively but are willing to only have it be generative given data.

+ “Non-generative models” are associated with undirected models, or energy-based models. I think most people who use improper priors don’t necessarily associate with them. They need a subclass of “non-generative models” to be more descriptive: “improper priors” seems better.

Reply ↓
- Andrew on June 18, 2017 9:16 PM at 9:16 pm said:
  
  Dustin:
  
  1. “Generative” specifically means being able to generate the data. So I don’t think “generative a posteriori” means anything.
  
  2. A model such as the Ising model (which I think is an example of what you are calling an energy-based model) is generative in the sense of my post above. My definition of generative is mathematical, not algorithmic; that is, I refer to a model as generative if one could generate parameters and data, even if that process would take an unrealistically long time.
  
  Reply ↓
  - Dustin Tran on June 18, 2017 10:13 PM at 10:13 pm said:
    
    I think your definition of generative is non-standard. At the least, it goes back to ambiguity, which is one reason you stated you dislike use of “improper prior”!
    
    Reply ↓
    - Andrew on June 19, 2017 7:02 AM at 7:02 am said:
      
      Dustin:
      
      From wikipedia: “In probability and statistics, a generative model is a model for randomly generating observable data values, typically given some hidden parameters. It specifies a joint probability distribution over observation and label sequences.” That’s what I’m saying: a joint probability distribution, which includes joint distributions like the Ising model that are difficult to draw simulations from.
      
      There are many different concepts floating around here. There’s the concept of the non-generative model (as defined by Wikipedia and me), there the concept of the model that cannot be computably generated (which is what you seem to be talking about and which can be defined, I think, only in an asymptotic sense), there’s the concept of the joint probability density function with infinite integral (sometimes called an improper prior), there’s the concept of the data-dependent prior distribution (another form of improper prior), there are related concepts such as approximate models or approximate algorithms. To me, the salient feature of the prior density with infinite integral and the data-dependent prior is that they are non-generative. There’s nothing wrong with people talking about all these things; the problem that motivated the above post is that in the literature (including mine) there’s a tendency to talk about so-called improper priors without recognizing that vary similar issues arise with data-dependent priors. I don’t want to ban data-dependent priors or non-generative models; I just think we should recognize them in our statistical practice.
    - Tom Dietterich on June 19, 2017 6:49 PM at 6:49 pm said:
      
      Andrew’s use of “generative model” is the same as the way we use this term in the machine learning community. We think of it as a simulation model for simulating data sets (which is why, as Andrew says, we would need a prior on N as well as the other model parameters).
Leon on June 18, 2017 10:32 PM at 10:32 pm said:

There does exist a generative interpretation of a bunch of improper prior models, if you interpret the improper prior as a Poisson point process on the parameter space:

“On Bayes’s theorem for improper mixtures” by Peter McCullagh and Han Han. Ann. Statist. Volume 39, Number 4 (2011), 2007-2020.

Reply ↓
Daniel Lakeland on June 19, 2017 12:28 AM at 12:28 am said:

It’s interesting to think about the improper prior in terms of nonstandard analysis (NSA). For example, using the IST form of NSA, we could create a nonstandard prior

a ~ normal(0,N) with N a nonstandard integer. (think of a nonstandard integer as an integer that is bigger than every integer that will ever have an explicit formula actually written down for it in the history or future of the universe).

Now, this is nonstandard-generative in the sense that we could using a nonstandard RNG sample a, but the probability that a will be limited (absolute value less than some standard number) is infinitesimal. In essence this nonstandard prior says that you’re QUITE SURE that a will be enormous.

That’s pretty much obviously wrong in most applied cases. using an improper prior is actually a strong prior telling your model that the value of the parameter is almost surely infinitely large, whereas in practical problems we can almost always without too much difficulty give a power of ten that we’re virtually certain the parameter will be less than, for example if you’re talking about a length, we can take the diameter of the known universe and multiply it by 10^100 and we’re extraordinarily certain our length will be less than this number. But a nonstandard prior says we’re extraordinarily certain the parameter will be BIGGER than that number!

This is a good reason to forgo the use of improper priors. In every real applied problem, we’re always sure that the parameter is a standard real number.

Reply ↓
Stephen Martin on June 19, 2017 1:42 AM at 1:42 am said:

I’m not sure one would need a ‘prior’ on N, if only because I’m not sure N is a random variable, no?

I mean, you /could/ place a prior on N; nothing is stopping you [in theory, anyway]. But I don’t think the failure to would render a density a non-generative model.

Though a counterpoint is that y ~ normal(0,1) is a generative model, with the implicit prior that p(mu = 0) = 1; p(sigma = 1) = 1. One could, I suppose say, y_n ~ normal(0,1,n=100) with an implicit prior that p(n = 100) = 1. Still, seems odd to me to consider N a random variable; without a prior on N, I would still consider it a generative model, just as one could do with any fixed parameter.

Reply ↓
- Andrew on June 19, 2017 6:51 AM at 6:51 am said:
  
  Stephen:
  
  You could say that a model that does not generate N could still be a generative model conditional on N, but then it’s not generative for new settings where N can take on other values. This is similar to how you can think of a regression model as generative conditional on X, but not generative in new settings with new values of X.
  
  Reply ↓
  - Mike on June 19, 2017 11:09 AM at 11:09 am said:
    
    Except that if you ever catch me in that position, I’ll just say that I used a uniform(0, S) prior where S is the number of atoms in the universe. As long as I don’t have to actually generate new data, I can say that the model is generative.
    
    Reply ↓
  - Bob Carpenter on June 19, 2017 4:01 PM at 4:01 pm said:
    
    I’ll repeat Andrew’s quote from Wikipedia:
    
    In probability and statistics, a generative model is a model for randomly generating observable data values, typically given some hidden parameters. It specifies a joint probability distribution over observation and label sequences.” That’s what I’m saying: a joint probability distribution, which includes joint distributions like the Ising model that are difficult to draw simulations from.
    
    Now let’s suppose I write this down
    
    $latex f(y) = \prod_{n=1}^{\mathrm{size}(y)} \mbox{Normal}(y \mid 0, 1)$
    
    Of course, $latex f$ doesn’t normalize unless you fix the size of $latex y$. Now, let’s say I fix $latex \mbox{size}(y) = 100$. Now everything normalizes and it satisfies the Wikipedia’s definition, but it seems defective as a generative process. We can instead treat the whole thing as an i.i.d. stochastic process, with
    
    $latex y_n \sim \mbox{Normal}(0, 1)$
    
    Is this somehow no longer generative because I’m not generating $latex \mbox{size}(y)$?
    
    Are your objections related to finite population stats? I never got a proper stats education, and most of the Bayesian stats I see is firmly rooted in super-population assumptions rather than finite population adjustments.
    
    Reply ↓
    - Daniel Lakeland on June 19, 2017 6:20 PM at 6:20 pm said:
      
      Bob, I don’t have a formal stats education either. I think it helps ;-)
      
      My intuition about what a “generative” model means actually is that you could imagine a finite sequence of calls to a random number generator that successively generate the parameters and then once all the parameters are generated generate the data.
      
      So, the algorithmic complexity is such that halting is guaranteed (given sufficiently sophisticated RNGs that can generate independent draws from whatever distro you’d like to specify in constant time, say).
      
      On the other hand, consider a model
      
      Data ~ normal(mu,sigma)
      mu ~ normal(0,100);
      sigma ~ gamma(3.0,3.0/sd(Data))
      
      it’s not possible to ensure that you can generate mu,sigma,Data in any fixed number of steps. You could do something like
      mu = rnorm(1,0,100)
      sigma = NonDeterministicallyChooseSomething
      Data = normal(n, mu,sigma)
      S = sd(Data)
      accept with probability gamma_pdf(sigma,3.0,3.0/S)
      
      Computability wise this seems to be in a different algorithmic class, it requires iteration and some special choice point / algorithmic nondeterminism.
      
      A generative model does produce a joint distribution, but it seems clear that some joint distributions aren’t subject to this finite step generative factorization.
      
      There’s something really interesting hidden in this intuition here. I hope Andrew will chew on it some more. I might blog a bit on it as well.
    - Stephen Martin on June 19, 2017 6:54 PM at 6:54 pm said:
      
      I’m not sure whether you’re replying to me.
      
      I’m not really objecting to anything. I just was surprised at ” For the model to be fully generative, we’d need a prior distribution for n as well.”
      I wouldn’t think n would be a requirement for a generative model, only because I don’t see it as a random variable.
      In a regression, it makes sense that: X is a random variable, Y is a random variable, betas are random variables, the residual variance is a random variable, etc. But it’s hard for me to think of N as a random variable. Because of that, I would consider a model that doesn’t treat N as a random variable to be generative.
      
      But I can see both sides too, since p(y | theta, N=100)p(N = 100), where p(N=100) = 1 is implicitly assumed when one fixes N to 100. Additionally, N observations *could* be dictated by other variables, like “we are collecting as much as we can in two days”; then N may vary by payment, number of accessible participants, data collector schedules, etc. But I’m in the world of “collect data until N=400”, in which case N isn’t really a random variable, but time is; this is probably what is biasing my read of that statement.
    - Christian Hennig on June 23, 2017 12:15 PM at 12:15 pm said:
      
      I’d think that one-point distributions are perfectly valid when it comes to defining generative models, so if you fix n and \theta and \sigma^2, “n observations i.i.d. from N(\theta,\sigma^2)” is a perfectly generative model.
  - Corey on June 19, 2017 10:04 PM at 10:04 pm said:
    
    Of course, priors that depend on n usually make no sense from the perspective that holds that priors should encode (some of) the available information about plausible parameter values. The logical connection between data set size and parameter values is typically… obscure, to say the least. They mostly arise in derivations of reference priors, don’t they?
    
    Reply ↓
    - Andrew on June 19, 2017 10:43 PM at 10:43 pm said:
      
      Corey:
      
      N can provide information in that it can be a proxy for information that was used in the design of a study that’s not included in the available data.
      
      As in many problems, I think we get some clarity by considering an existing problem as part of a larger hierarchical model or meta-analysis. So if we have a regression with outcomes y, predictors x, and sample size N, we can think of this as one of a larger class of problems, in which case it can make sense to think of N and x as varying across problems.
      
      The issue is not so much whether N is a “random variable” in any particular study (although I will say that, in real studies, N typically is not precisely defined ahead of time, what with difficulties of recruitment, nonresponse, dropout, etc.) but rather that N can vary across the reference class of problems for which a model will be fit.
    - Corey on June 19, 2017 11:25 PM at 11:25 pm said:
      
      Well sure, N can provide information by proxy, and N obviously varies across problems (and observed N is not necessarily equal to design N). But have you, AG, ever used a prior that was formally a function of N? (I personally can’t recall ever using N even informally to help set up weakly informative priors.)
    - Daniel Lakeland on June 19, 2017 11:42 PM at 11:42 pm said:
      
      I don’t know about priors, but N can inform the likelihood relatively easily when there’s a stopping rule or whatnot. And I’ve definitely used N in models where there is censoring. For example in a situation where you’re counting fruitflies, I’ve used N* as a parameter, the unknown number of eggs laid, and then various probabilities of certain genotypes based on recombination probabilities and fatal genotypes etc, and then an observed N that results is such that N ~ distrib(f(N*),Params)
      
      So in that case N* is a parameter and there’s a prior over it… does that count?
      ;-)
    - Carlos Ungil on June 19, 2017 11:54 PM at 11:54 pm said:
      
      Could you give an example of how N can inform the likelihood when there is a stopping rule?
    - Corey on June 20, 2017 12:04 AM at 12:04 am said:
      
      Naw, that’s a genuine unknown quantity. I’m talking about priors where in counterfactual world where N was larger by one you’d use a prior with a (slightly) different shape on parameter space and the functional dependence of that shape on N is explicit.
    - Daniel Lakeland on June 20, 2017 10:53 AM at 10:53 am said:
      
      Carlos: survey people and ask them if they enjoy product X, product Y, product Z, and stop after you’ve seen 3 people in a row who enjoy product X.
      
      If that takes a really long time, it tells you something about how small the frequency is vs if you stop right away.
      
      kind of a contrived example, but here Nx, Ny, Nz and Ntotal are unknown at the start of the experiment.
      
      Corey: I didn’t see that as the original point. I think the original point is that if you do an experiment under alternative conditions, you might get a different N and so if you want to generate fake datasets you have to have a prior *for the quantity N*
    - Corey on June 20, 2017 10:56 AM at 10:56 am said:
      
      Daniel: Ah right. That makes more sense
    - Carlos Ungil on June 20, 2017 3:15 PM at 3:15 pm said:
      
      Daniel: I don’t understand your example.
      
      Just to be clear, I thought that you meant that the likelihood function (which is combined with the prior and the data in a Bayesian analysis) will be different depending on the stopping rule used. Please disregard what follows if I misinterpreted your comment.
      
      > survey people and ask them if they enjoy product X, product Y, product Z, and stop after you’ve seen 3 people in a row who enjoy product X.
      
      Let’s say you get the following responses, using your stopping rule: [+ – +] [+ – -] [- – -] [- + +] [- + -] [+ + -] [+ + +] [- + -] [+ + +] [+ + +] [+ + +]
      
      Let’s say I perform an identical study with a different stopping rule (for example, perform eleven surveys) and get exactly the same responses.
      
      Are you saying that the likelihood in your analysis will be different to the likelihood in my analysis?
    - Corey on June 20, 2017 3:41 PM at 3:41 pm said:
      
      Carlos, I don’t think Daniel’s saying you’d get a different likelihood function — I think he’s saying that N is the data (as opposed to an experimental design parameter) and informs the likelihood in that way.
    - Carlos Ungil on June 20, 2017 4:05 PM at 4:05 pm said:
      
      Corey, what would “N can inform the likelihood relatively easily when there’s a stopping rule” mean precisely in that case? When there’s no stopping rule (or rather a non-stochastic one: a fixed value of N) the likelihood is also being informed (even more so, I would say).
    - Daniel Lakeland on June 20, 2017 4:15 PM at 4:15 pm said:
      
      Carlos, consider the following fact. In the design that says “do 11 surveys” the last 3 survey responses could be anything, but in the “stop when you get 3 in a row” the last 3 survey responses are *always* 3 positive reviews of product X. Conditional on the design, you should consider the last three responses as providing different information. In the stopping rule version there is nothing uncertain about the last three x responses.
      
      So, yes I think you do need a different likelihood function, one which assigns probability 1 to a sequence of 3 positive results for x at the end.
    - Carlos Ungil on June 20, 2017 4:19 PM at 4:19 pm said:
      
      Nevermind, I think I got what you meant. In the example I gave the general function (of parameters and data) is different when N is fixed and when there is a more complex stopping rule (the function has to take care in that case of a data vector of variable length). The later will “collapse” once N is known to become identical to the former, and conditional on the data the likelihood function (of parameters only) will be the same in both cases.
    - Daniel Lakeland on June 20, 2017 4:22 PM at 4:22 pm said:
      
      Carlos, yes you’re on the right track, but as I say in the stopping rule example, the last 3 data points *always* have +X so in fact they don’t become the same when you finally plug in N, because you need to assign P = 1 to a sequence of 3 +X values at the end, whereas that’s not the case in the fixed length version.
    - Carlos Ungil on June 20, 2017 4:27 PM at 4:27 pm said:
      
      Daniel, when you have the data there is nothing uncertain about the last three x responses in the “11 surveys” setting either. You will get the same likelihood (a function of the parameters) in both cases, but it’s true that you will arrive at that likelihood by different paths (see my previous comment).
    - Carlos Ungil on June 20, 2017 4:44 PM at 4:44 pm said:
      
      Example: we want to estimate the parameter for a Bernouilli trial (let’s call p the probability of getting Head when a coin is flipped).
      
      I toss the coin once: I get Head.
      
      You toss the coin until you get Head: you get Head.
      
      I say the likelihood function is (up to a constant) L(p) = p in both cases.
      
      What do you say?
    - Daniel Lakeland on June 20, 2017 5:03 PM at 5:03 pm said:
      
      Carlos, you might be right, I’d have to work out the details to see if you get the same results. It seems to me like the likelihood associated with the fixed length would have independent factors related to all 11 answers… whereas the likelihood for the stopping rule case would have independent factors associated with 8 answers, and associated with the N at which the 3 final answers occurred. Perhaps that winds up being the same, but it certainly isn’t the same pre-data, and the data generating process you look at to create fake data sets is definitely very different.
    - Daniel Lakeland on June 20, 2017 6:11 PM at 6:11 pm said:
      
      Run a set of surveys knocking on successive doors until your feet hurt or you get discouraged vs knock on fixed 100 doors.
      
      The model now needs extra parameters involving pain in the feet and discouragement. Suppose the disposition of the surveyor affects the answers to the surveys, suppose the stopping rule also affects the way in which the survey is carried out (perhaps someone who knows they need 100 doors goes slower, takes more breaks or something)
      
      Now, whether the person has pain in the feet (maybe causes sympathy among subjects so they answer medical questions differently) or discouragement (maybe causes different answers to economics questions about consumer confidence) could affect the outcomes.
      
      I realize this is all contrived, but I’m just trying to come up with some example where the data collection process affects the outcomes, and even if you happen to get the same outcomes, you might infer different parameter values because of the feedback of the data collection method onto the outcomes.
    - Carlos Ungil on June 20, 2017 9:55 PM at 9:55 pm said:
      
      > I’m just trying to come up with some example where the data collection process affects the outcomes, and even if you happen to get the same outcomes, you might infer different parameter values because of the feedback of the data collection method onto the outcomes.
      
      How about this? We want to estimate the parameter for a Bernouilli trial (let’s call p the probability of success of some experiment).
      I use N=1 fixed. Data = “success”. Likelihood(p)=p
      Your stopping rule is that you don’t feel like doing any work, you’d rather use a pseudo-random number generator to generate an answer and call it a day. Data = “success”. Likelihood(p)=1
      
      (You may also want to read about “informative stopping rules”.)
    - Corey on June 20, 2017 10:15 PM at 10:15 pm said:
      
      Daniel, Carlos’s “you toss the coin until you get head” design induces a negative binomial sampling distribution (heads here maps to the “failure” outcome in the article).
      
      I’m a bit surprised you’re not already familiar with this. It is in fact a fairly well-known example of a case where frequentist inference and likelihood-principle-compliant inference disagree: the binomial and negative binomial sampling distributions induce the same likelihood function but have different tail areas. David Mackay’s book (warning: big PDF) has a typical Bayesian exposition starting at the bottom of page 462.
    - Daniel Lakeland on June 20, 2017 11:34 PM at 11:34 pm said:
      
      Corey: of course I’ve heard of the negative binomial distribution, and I’ve done the typical homeworks etc, even taught it to undergrads, but I’ve never really thought or read that much about the comparison of likelihood and frequentist inference under stopping rules. I see from Carlos’ suggestion that there is a distinction between “noninformative” and “informative” stopping rules. So I think that answers the fundamental question. In the presence of an informative stopping rule, the likelihood needs to be altered to take account of the additional information. Apparently this is fairly unusual in the world of contrived textbook examples. But in the world of complex haphazard nuisance parameter filled real-world data collection procedures, I could see it being a real thing we have to think about more often.
    - Daniel Lakeland on June 21, 2017 9:21 AM at 9:21 am said:
      
      Here’s a possible example: you run a clinical trial. There are two possible protocols, one in which if people drop out you record them as dropped out, and one in which they agree to stiff financial penalties for dropping out and if they don’t show up you follow up heavily, and your protocol basically specifies that you do whatever it takes to eventually record their full set of measurements.
      
      Now, in the first scenario where dropout is possible, we assume there are hidden parameters that describe why they might drop out. In the second scenario, although there are hidden parameters for why they might want to drop out, drop out is virtually guaranteed not to be observed (further lets posit that the efforts gone to to get the patient to comply were not recorded so you can’t regress on “effort”).
      
      Now in the first scenario Nin and Nout in each group are themselves data which inform us about the parameters related to “desire to drop out” (perhaps related to side effects or lack of main effect or whatever) whereas in the second case Nin = 100% by design and so we actually lack this information, we’ll have to infer side effects or main effect issues directly from whatever questions we asked, and it’ll be uncensored due to our heroic efforts.
      
      I think this kind of thing is far far more common in messy real world scenarios than it is in textbook coin flipping games and soforth. I see again via googling that the theoreticians separate things into “informative” and “noninformative” censoring scenarios. I suspect informative censoring is more like the rule than the exception when censoring is caused by some aspect of treatment, whereas when censoring is more or less consistent (the machine tops out at 100grams) or random (12% of data is missing due to the machine’s limitations) you’d wind up with uninformative censoring.
    - Daniel Lakeland on June 21, 2017 11:41 AM at 11:41 am said:
      
      I love these old microfiche papers: This one by Barlow and Shor out of UC Berkeley explains: when a stopping rule is random and non-independent of the parameters, the stopping rule is informative and must be taken into account.
      
      So, assuming they’ve analyzed this correctly, my intuition about how stopping rules that are related to parameters such as unobserved pain in the feet of the surveyor, or unobserved side-effects of the medicines in the clinical trial, etc are exactly the kind of thing where variable stopping alters the likelihood of observed data.
    - Daniel Lakeland on June 21, 2017 11:42 AM at 11:42 am said:
      
      Of course it helps to include the link :-)
      
      http://www.dtic.mil/dtic/tr/fulltext/u2/a140599.pdf
    - Daniel Lakeland on June 21, 2017 11:53 AM at 11:53 am said:
      
      Also, Lindley who makes the distinction Carlos made: between the likelihood function pre-data, say where N is an algebraic variable and the likelihood function after plugging the data in where N is a number.
      
      http://www.commanster.eu/articles/stoprule.pdf
      
      And he reiterates that to be uninformative the stopping rule needs to be independent of the unknown parameter. For example if you do a Bayesian analysis after each data point comes in, and then you stop when the posterior probability of some parameter value drops low enough for some rule (or you hit a maximum number of observations), and then you want to use this data to do inference on another parameter later in some expanded model… you’ll need to include the fact that the stopping rule was dependent on the inference from the simpler model about the parameter.
      
      That seems very relevant to adaptive clinical trials for example.
    - Corey on June 22, 2017 9:37 AM at 9:37 am said:
      
      Daniel: just to clarify, my surprise is that you haven’t encountered this particular example illustrating the likelihood principle — Bayesians love this one because you can make the stopping rule contingent on all sorts of things the experimenter might not actually be aware of. (For example, a samurai is hiding nearby and plans to swoop in and slice the coin in half with his sword on the second toss, but you stop on the first and he never gets a chance. Should this affect your inference?)
    - Daniel Lakeland on June 22, 2017 11:36 PM at 11:36 pm said:
      
      Corey: ironic how here I am a pretty hard core Bayesian and I’m arguing that in most real world scenarios when some aspect of the experiment is under a person’s control, you’d be best off modeling that into the data generating process, instead of ignoring it and pretending the reason why people stop collecting data is irrelevant.
    - ojm on June 24, 2017 9:05 PM at 9:05 pm said:
      
      Being familiar with the literature isn’t a pre-req for having strong opinions on Bayes vs Freq.
    - Daniel Lakeland on June 24, 2017 10:08 PM at 10:08 pm said:
      
      ojm: when it comes to stopping rules. You’re right. I had little background with the literature. But, at the same time, when it comes to interpreting the literature, the thing I found most confusing is the notion of “random” vs “deterministic” and I think this is the big thing I walked away with finally. The wishywashy pseudo-frequentist Bayesian treatment obfuscates things for me at least.
      
      Random vs deterministic when it comes to Cox/Jaynes probability *SIMPLY IS NOT A PROPERTY OF THE STOPPING RULE* it’s a property of *your background knowledge*, including the knowledge of what the rule was.
      
      I reinterpreted the final result of all of this into my own blog post:
      
      http://models.street-artists.org/2017/06/24/on-models-of-the-stopping-process-informativeness-and-uninformativeness/
      
      where, once cast into the Cox conception (probability is assigned conditional on background knowledge) the result is basically obvious. It comes down to this:
      
      if you can determine from the data with 100% reliability whether the experiment stops, then the data and background contains all the information you’d need. This is because as soon as you see the data, you can infer the stopping logically. Fine.
      
      Next if you can’t determine from the data with 100% reliability whether the experiment stops, then the fact of stopping is additional information, but if your assigned probability doesn’t change with any of the parameters… then it doesn’t inform you about anything.
      
      All of this is true, regardless of whether the person actually running the experiment was able to deterministically determine stopping. That’s the part that isn’t obvious from the literature. “stop when you see HHH” is clearly deterministic to the experimenter, whereas “stop when you see the special secret sequence” is deterministic to the experimenter who knows the secret, and random to you.
      
      In all the cases I actually care about, the real reason for stopping is unknown to me, and usually not explicitly known to the person who made the decision either. (ie. it’s a gut instinct thing or it’s a negotiation between several parties, etc)
      
      So, hey. I dare you to find a better treatment of this real world issue regarding nebulous stopping rules than what’s here.
    - Keith O'Rourke on June 20, 2017 8:09 AM at 8:09 am said:
      
      > clarity by considering an existing problem as part of a larger hierarchical model or meta-analysis
      That was Fisher’s take in his earliest papers and his 1956 book though AWF Edwards told me he does not think anyone noticed that.
      
      e.g. “In practical terms, if from samples of 10 two or more different estimates can be calculated, we may compare their values by considering the precision of a large sample of such estimates each derived from a sample of only 10, and calculate for preference that estimate which would at this second stage [meta-analysis stage] give the highest precision.”
Larry Raffalovich on June 19, 2017 4:23 AM at 4:23 am said:

How is selecting a prior after viewing the data different from selecting regressors after viewing
the data or selecting a model by max (or min) some function of the data?

Reply ↓
- Daniel Lakeland on June 19, 2017 11:14 AM at 11:14 am said:
  
  All of them are fundamentally the same and related to my idea linked above http://statmodeling.stat.columbia.edu/2017/06/18/dont-say-improper-prior-say-non-generative-model/#comment-509720
  
  Reply ↓
  - Daniel Lakeland on June 19, 2017 11:24 AM at 11:24 am said:
    
    It’s only a truly legitimate practice if there were no other models that survived your first peek. Otherwise you should code up your alternative possibilities and do bayesian model selection (or continuous model expansion) or at least independently fit your several models and do some posterior predictive checks, or something. Don’t just pretend that other models are ruled out when they aren’t.
    
    As for picking a prior after viewing the data. I’ve done this for computational efficiency reasons. Sometimes without a fairly strong prior Stan will run off to a pathological corner of the model space (look, if I change parameter A to 1 the likelihood is peaked at x=0 and then if I run parameter 2 out to infinity, all the data points fit into that little peak!). Putting strong priors on the parameters can prevent that. Typically I do this for debugging and then once the model fits, I go back and relax the priors until they more reasonably represent real prior information. Usually in the pathological cases I had originally just put in some prior without giving it much thought.
    
    Reply ↓
Carlos Ungil on June 21, 2017 1:54 PM at 1:54 pm said:

(I write this reply to Daniel at the top level because we went beyond the nesting limit a while ago. And for the record, that’s a paper by James K. Lindsey, not Dennis V. Lindley).

> my intuition about how stopping rules that are related to parameters such as unobserved pain in the feet of the surveyor

How is the pain in the feet of the surveyor dependent on the parameters of interest?

> if you do a Bayesian analysis after each data point comes in, and then you stop when the posterior probability of some parameter value drops low enough for some rule (or you hit a maximum number of observations)

That stopping rule is independent of the unknown parameter.

Reply ↓
- Daniel Lakeland on June 21, 2017 4:18 PM at 4:18 pm said:
  
  Carlos: re authorship: Whoops
  
  Re pain in the feet: suppose model is
  
  AnswerToQuestion[i,j] ~ normal(PersonAvg[i]+QuestionerInducedBias[i],sigmaPerson);
  
  PersonAvg[i] ~ normal(PopulationAvg,PopSigma);
  
  QuestionerInducedBias[i] ~ normal(PainInFeetOfQuestioner(t),biassd);
  
  and we have a model for how PainInFeet… builds up over time, which is informed by N the number of people surveyed before the questioner gives up due to walking around in uncomfortable shoes.
  
  Now, conditional on data, PopulationAvg is not independent of the parameters that describe the PainInFeet function. The more we know about pain in feet, the more we can correct for its bias to the PopulationAvg. Therefore, by observing N the point at which the surveyor gave up, and having a model for how pain in the surveyor’s feet caused them to give up, we’ll get different bias corrections for different Ns.
  
  ———–
  
  Bayesian analysis after each data point comes in case:
  
  Suppose we’re trying to rule out par=0. We do Bayesian test after each data point. If par is far from zero, we detect this early and stop early, if par is nearer to zero, we need much bigger N, so we stop much later. How is that independent?
  
  Reply ↓
  - Carlos Ungil on June 22, 2017 12:31 AM at 12:31 am said:
    
    For the “pain in the feet” stopping rule my question was based on the quote “Run a set of surveys knocking on successive doors until your feet hurt or you get discouraged vs knock on fixed 100 doors.”. I forgot that a bit later you assumed that it also had a direct effect on the outcome. I’m sorry I missed that, it’s hard to keep track of the goalposts.
    
    For the second scenario, is the stopping rule based just on the data? This is what it seems to me: parameters -> data -> interim analysis -> stopping decission. In that case, how is the stopping rule dependent on the parameters conditional on the data?
    
    Reply ↓
    - Daniel Lakeland on June 22, 2017 12:50 PM at 12:50 pm said:
      
      Carlos, sorry for the goalpost movement, I’m trying to show that there are lots of real world considerations that look a lot different from what you’d get in some kind of textbook situation.
      
      I think in the real world, people who don’t have a pre-determined sample size tend to stop their sampling for all kinds of “real world” reasons like they got tired, they ran out of the reagent, the person they hired to do the work quit for a better job, the person they hired to do the work got sick, the people who were recruited to the study found it extremely uncomfortable, blablabla and often there is some feedback on the outcome… you run out of reagent because the experimenter was sloppy and tended to spill it, the worker quit for a better job because they hated the boss, and because they spent so much time looking for a new job, they did poorer quality work, the recruits found it uncomfortable because the side-effects were pretty severe, and thus there was really no blinding, and there was differential drop-out in the control vs experiment group… etc etc. In the real world when you see a design where N is not pre-determined you should be looking for the causes of the N, because usually it’s not “sample until you get 3 in a row” it’s “do some stuff, and then something happens, and then decide to cut the study short” and the reason why you cut the study short is often informative about what was going on.
    - Carlos Ungil on June 22, 2017 3:08 PM at 3:08 pm said:
      
      Nobody forces me to keep running towards the new goals, so I can’t really complain. Nevertheless, it can be frustrating that one of you standard responses to many arguments is “ok, maybe, but what about this more complex example I’m just making up”. I’m sure everybody is aware of the fact that any model can be made more complex.
      
      It’s still useful to understand simple textbook models. It’s useful to understand why the standard position of bayesians about stopping rules is “they don’t matter” (if you remain unconvinced, see for example http://www.ejwagenmakers.com/2007/StoppingRuleAppendix.pdf ). You will still be able to invent contrived examples where this is not true, and you will know exactly how contrived they need to be!
      
      As Corey said, stopping rules are a favorite example of both frequentists and Bayesians to show how the methods of the other camp lead to absurd conclusions.
    - Carlos Ungil on June 22, 2017 3:17 PM at 3:17 pm said:
      
      Another relevant paper (simulation based, because everything gets better when you add RNGs):
      http://pcl.missouri.edu/sites/default/files/Rouder-PBR-2014.pdf
    - Daniel Lakeland on June 22, 2017 3:48 PM at 3:48 pm said:
      
      Of course I’ve heard the “stopping rules don’t matter” before, and I am aware that there are examples where stopping rules don’t matter, and I believe the flipping the coin and stopping after three heads leads to eventually the same likelihood after you plug in the data as if you had pre-specified the N (but we both agree it leads to different data generating processes pre-data!)
      
      But to be honest, I tend to work in scenarios like where one of my friends comes to me and they’ve got a really complicated biological experiment that’s been carried out over 2.5 years by 4 or 5 different people, and by the time I’m done talking to them about the situation involving how they first started measuring things with instrument 1 and then the guy who ran that instrument quit, and so they moved to instrument 2, and then they couldn’t source reagent q anymore so the second half of the experiments were done with reagent z and blablabla, and deep into a half hour discussion of how the experiment was run they tell me things like “it seemed like we had a pretty consistent result, so we stopped collecting data”
      
      it’s not so convincing to me to just repeat the mantra “stopping rules don’t matter” and move on.
      
      And that kind of thing has been basically bread and butter over the last few years. I have these conversations multiple times a year.
      
      “it seemed pretty consistent so we stopped” is that a random stopping rule or not? Is it dependent on the parameters of interest or not?
    - Carlos Ungil on June 22, 2017 3:57 PM at 3:57 pm said:
      
      How is that dependent on the parameters of interest ****** CONDITIONAL ON THE DATA ****** ?
    - Daniel Lakeland on June 22, 2017 2:05 PM at 2:05 pm said:
      
      also, in the second scenario, the stopping rule is not just based on the data, it’s based on the data, and the inference *from a specific model*. So then if you decide to reanalyze the data in the context of a different model, you need to consider the fact that the stopping rule under the old model may be statistically dependent on some new parameter in the new model.
      
      Also, question for you regarding interpretation, since maybe you have some more experience with this question than I do. When in the UC Berkeley paper they say: “A stopping rule, given data, is Informative relative to parameters of interest if it is random and statistically dependent on those parameters.”
      
      So, we collect some data and each time we do an analysis to get a posterior distribution of parameter p, adding one more data point will produce a “random” new posterior. As soon as that posterior “rules out” p = 0 sufficiently, we stop, so probabilistically speaking it certainly seems that the stopping rule is dependent on p, in the sense that Pr(N = n | p) is a function of p.
      
      I have problems translating all of this into the language of Cox/Bayes because the Berkeley original seems to be pretty firmly in the Probability = Frequency interpretation. But I think you can potentially rewrite it as something like “partial knowledge of the parameter of interest alters your knowledge of the stopping point”
      
      But, then, that would seem to apply to the “stop after 3 heads” case as well. If p is very small, then you expect to flip many times, and if p is large you expect to flip few times. But I think you’ve convinced me that this results in the same likelihood. Is that just an accident of the bernoulli/negative binomial symmetry? Possibly. Or possibly we need a more carefully constructed definition of “informative stopping rule”
    - Carlos Ungil on June 22, 2017 2:50 PM at 2:50 pm said:
      
      Or maybe we just need to read carefully the existing definition of informative stopping rule?
      
      A stopping rule, = “stop after three heads”
      given data, = “+ + – – – + + – + + +”
      is informative relative to parameters of interest if it is random and statistically dependent on those parameters. = “the rule says STOP, which is not random (let alone dependent on the parameter)”
      
      Given data ” [+ – +] [+ – -] [- – -] [- + +] [- + -] [+ + -] [+ + +] [- + -] [+ + +] [+ + +] [+ + +] ” the stopping rule
    - Carlos Ungil on June 22, 2017 2:52 PM at 2:52 pm said:
      
      (Please ignore the last line in the previous comment.)
    - Daniel Lakeland on June 22, 2017 3:39 PM at 3:39 pm said:
      
      Carlos: one of the great things about discussing this stuff with you is that you follow along with the goalposts. I’ve learned a lot by you pushing me to be more precise or to understand something more carefully, and I hope that the secret lurkers who read this stuff also have, so thank you! (and lurkers are free to chime in with +1 here)
      
      As for the definition of the stopping rule: see I have difficulty with the term Random. I think lots of people go along with the word Random without thinking too much about it. Let’s use for the definition of random that it does not have probability 1 or 0. But we’re Bayesian, so we have to ask at what point in time and for whom or more generally, conditional on what knowledge?
      
      Our stopping rule is “stop after 3 heads in a row”.
      
      I am about to start flipping coins, what is the probability I will stop after the 3rd as of this moment calculated by me? We both agree there is zero probability for me to stop after 1 or 2. What knowledge is this conditional on? Assuming I know p = 1/2 I could say the probability I will stop after the 3rd is 1/2^3 but if I don’t know p then all I can say is 1/p^3 for some unknown p and I can put a prior on it … etc. So, at the start of the whole thing the stopping rule is random and dependent on p.
      
      Now I get h,h,t
      
      what is the probability I will stop after 3 at this point in time? the probability is zero at this point, because I know I don’t have h,h,h
      
      Let’s change the rule to illustrate further confusion: after getting 3 heads in a row, ask Joe if you should stop, and if he says yes, stop.
      
      I now get h,h,h what is the probability I will stop as calculated by you? what is the probability I will stop as calculated by Joe?
      
      Or if you like, after 3 h values, call rbinom(1,0.5) and stop if it’s 1, what is the probability of stopping if you know the RNG seed, what is the probability of stopping if you don’t know the seed?
      
      So, to me, “random” is extremely problematic. It’s this feature of the definition that I find confusing.
      
      So let’s go back to the definition, when I read “random” I assumed it meant “not determinable at the *start* of the experiment” whereas you are reading it as “not determinable at the point in time when the stopping decision is made” and it’s clear that there are rules such as the RNG seed or Joe example where for some people, the rule is random, and for some people the rule isn’t random…
      
      I have problems with understanding that.
    - Carlos Ungil on June 22, 2017 3:54 PM at 3:54 pm said:
      
      It’s seems to me that your problem is with “given”. The question is whether the stopping rule is random given the data. If the rule is “stop after three heads” then it isn’t random. Period.
      
      If it happens to be random, the second condition for the stopping rule to be informative is that it has to be correlated with the parameters of interest. If the parameter of interest is theta, stopping depending on a random rbinom(1, 0.5) will not be informative. Stopping depending on the output of rbinom(1, theta) would be informative.
    - crh on June 22, 2017 4:06 PM at 4:06 pm said:
      
      > (and lurkers are free to chime in with +1 here)
      
      +1
    - Daniel Lakeland on June 22, 2017 5:01 PM at 5:01 pm said:
      
      Good, ok that helps I think. So if p(Stop | Data_so_Far, Model) = 0 or 1 we can call it nonrandom (under that model) and therefore not informative (under that model).
      
      Now, let’s go back to my example where you do a bayesian pre-analysis and then if you can reject q=0 for some level of probability you stop. You might argue that as you get your data, and do your bayesian pre-analysis before deciding to stop, the Bayesian analysis will, conditional on data, give you a deterministic value for the posterior distribution of the parameter and so if you stop when Pr(q = 0) < 10^-3 this is deterministic given the data, and the model. (if you do your analysis in Stan, there's always MCMC error too… but we can make this small).
      
      But after you've stopped, and a few weeks later you discover there was some issue you hadn't considered, say like a drifting measurement error in some instrument, or a consistent bias in the polling methods that favored Hillary, or whatever, then
      
      p(Stop|Data_so_far,Model2,additionalParameter2) does vary depending on the additionalParameter2 (say the size of the bias correction)
      
      Is the argument then that the rule "calculate p(q| Data_so_far,Model1) and stop when it's small or you've sampled at least N" is deterministic given the data and so it's not under Model2 of any relevance anymore?
      
      How about if you modify the rule to be "generate 10 random qs from the posterior under Model1 and if none are less than 0.1 stop"
      
      This seems to me to be very relevant to the design of stopping experiments. It's safe to consider a Bayesian stopping rule as dependent only on the data *and the model* and not the parameter, provided you never ever want to use a different model to analyze this data. This is a kind of risk you take by deciding to use a Bayesian stopping rule. The stopping rule may make re-analyzing your data using a different model more complicated.
      
      This case where we're assessing the posterior using simulation seems interesting. The posterior density for q is deterministic given the data, but the q* value itself isn't, and clearly the q* value is correlated to q_real (that's why we got the posterior in the first place)
      
      If we go back to "it looked pretty consistent so we stopped" is this not like "get a kind of gut instinct point estimate of some parameter, then make a random decision about whether to continue or stop based on what that point estimate was" ? which is sort of along the same lines as "randomly generate a value q* from the posterior, and then rbinom(1,q*) decides your stopping"
      
      This kind of thing is shockingly common, where a vague idea that the data looks ok based on some expectation of what you're going to get leads you to randomly stop after you've fulfilled that expectation in some imprecise sense.
    - Daniel Lakeland on June 22, 2017 6:12 PM at 6:12 pm said:
      
      ‘How about if you modify the rule to be “generate 10 random qs from the posterior under Model1 and if none are less than 0.1 stop”‘
      
      See I like this example a lot, because if we assume that in essentially all real world cases, we’re estimating probabilities via sampling, then a rule like “if p(q<0.1) < 0.1 stop” which is deterministic in the case of perfect symbolic calculation, is random without knowledge of the RNG seed in the case of real world calculation (but deterministic if you know the seed).
      
      If you later want to analyze this under some more realistic model involving more factors you’re in a weird position, you’ll be asking:
      
      “what is the probability under model 2 that when you calculate probability of q under model 1 using Stan, you will stop, and is this independent of the value of all the parameters *in model 2*”
      
      How about this for a stopping rule:
      
      Flip your coin, if you get 3 head in a row, flip your coin again and if it’s a head stop.
      
      clearly that final flip is like rbinom(1,p) and the definition says if the decision rule is random given the data and dependent on p then it matters… but this decision rule is random given the data (up to the 3 heads in a row) and dependent on p, but produces exactly the same decisions as “if you get 4 heads in a row stop” which is not.
      
      Weird.
    - Daniel Lakeland on June 22, 2017 6:23 PM at 6:23 pm said:
      
      Aha, no I get it. In the case where you see 3 heads, then flip another coin and if it’s heads stop. You can immediately infer that the last coin flip was heads even if it’s not recorded so if you have
      
      ththhh
      
      your likelihood is actually based on the sequence
      
      ththhhh
      
      which is the same as 4 in a row deterministic stopping.
      
      which just goes to show that in fact the hhh + flip stopping rule… is informative
      
      I’m still not sure how to deal with the idea of analyzing a dataset under model2 which had a stopping rule based on model1 and random samples from the model1 posterior.
      
      good times though.
    - Andrew on June 22, 2017 6:25 PM at 6:25 pm said:
      
      Hi—I won’t jump into this discussion except to recommend that youall read chapters 6 and 8 of BDA which address some of these issues, including the partial but not complete relevance of the likelihood principle to Bayesian data analysis.
    - Daniel Lakeland on June 22, 2017 6:51 PM at 6:51 pm said:
      
      Andrew: thanks, I’ll look there.
      
      Carlos: Here’s where I think we can both agree. We agree that in the presence of a stopping rule, the data generating process is different (this is what you said about how the likelihood as a function of abstract symbolic quantities is different, it has to be capable of handling different length vectors etc). After we see data and plug it in, it may or may not be the case that a model which assumes a fixed N equal to the observed N, and a model which models the data generating process directly will arrive at the same likelihood function over the parameters. In many cases, it will. but *you won’t go wrong by modeling the data generating process* whereas under informative stopping rules you *will* go wrong by ignoring the stopping rule and assuming fixed N.
      
      Is that all fair to say? I think this much we agree on. I admit to confusion about how to evaluate whether a stopping rule is informative by the “random and probabilistically dependent on the parameter conditional on the data” definition because I’ve hyper-internalized the cox/jaynes conception of all probabilities being conditional on some background information, and so without specifying what background information we are using, I don’t have a good idea what it means for a stopping rule to be “random”, hence things like the PRNG seed suddenly becoming relevant to the question or whether a rule can be considered informative under one model and uninformative under another model that has different parameters. In many of these textbook examples, the exactness of the data generating process is assumed, ie “flip a coin that has a constant p” is *really true* about the world, whereas in the “here’s some data, here’s what we did, can you model it” scenario, there are competing models and all of them are known to be not literally true. The parameter is not a feature of the world, it’s a feature in our head that helps us explain the world.
    - Carlos Ungil on June 23, 2017 3:47 AM at 3:47 am said:
      
      > if you stop when Pr(q = 0) Is the argument then that the rule “calculate p(q| Data_so_far,Model1) and stop when it’s small or you’ve sampled at least N” is deterministic given the data and so it’s not under Model2 of any relevance anymore?
      
      In principle, I don’t see the problem. Under model 1, is the stopping rule that was applied correlated with the parameters? If not, it’s not informative. Under model 2, is the stopping rule that was applied correlated with the parameters? If not, it’s not informative. Of course if you don’t know the answer to that question you won’t know whether it’s informative or not. But if you cannot get the answer to that question you won’t be able to write a more comprehensive model either…
      
      > This seems to me to be very relevant to the design of stopping experiments. It’s safe to consider a Bayesian stopping rule as dependent only on the data *and the model* and not the parameter, provided you never ever want to use a different model to analyze this data. This is a kind of risk you take by deciding to use a Bayesian stopping rule. The stopping rule may make re-analyzing your data using a different model more complicated.
      
      A stopping rule that depends only on the data *and the model*, as you say, depends only on the data. The model in the stopping rule is not a random variable, I think the *and the model* qualification makes no sense. If you want to analyse the data with a another model a stopping rule that was deterministic will still be deterministic, so it cannot be correlated to the parameters in the new model.
      
      Of course if the stopping rule is not deterministic given the data it might be unrelated to the parameters in model 1 but be related to the parameters in model 2. Do your “10 random qs” depend on the parameters of model 2? (See the temperature-mediated example of correlation between the stopping rule and model parameters above).
      
      > the q* value is correlated to q_real (that’s why we got the posterior in the first place)
      
      The q* value is not correlated to q_real ****** CONDITIONAL ON THE DATA ******.
      
      > If we go back to “it looked pretty consistent so we stopped” is this not like “get a kind of gut instinct point estimate of some parameter, then make a random decision about whether to continue or stop based on what that point estimate was” ? which is sort of along the same lines as “randomly generate a value q* from the posterior, and then rbinom(1,q*) decides your stopping”
      
      Is that decision correlated to the model parameters ****** CONDITIONAL ON THE DATA ******? Otherwise, it’s non-informative.
    - Carlos Ungil on June 23, 2017 4:03 AM at 4:03 am said:
      
      Somehow I lost the first paragraph in my previous message. Here it is again:
      
      > the Bayesian analysis will, conditional on data, give you a deterministic value for the posterior distribution of the parameter and so if you stop when Pr(q = 0) < 10^-3 this is deterministic given the data, and the model. (if you do your analysis in Stan, there's always MCMC error too… but we can make this small).
      
      Even if the Stan analysis is non-deterministic, this determinism will be irrelevant unless it's correlated with the parameter. (No need to tell me that it could happen, I can figure out an example myself: let's say we have a textbook coin-flipping example using a bimetallic coin, so the parameter theta changes with the temperature, and let's say the analysis is performed using a computer known to malfunction when the temperature increase, biasing the response).
  - Corey on June 22, 2017 9:16 AM at 9:16 am said:
    
    “Suppose we’re trying to rule out par=0. We do Bayesian test after each data point. If par is far from zero, we detect this early and stop early, if par is nearer to zero, we need much bigger N, so we stop much later. How is that independent?”
    
    For extra fun times: even if it’s true that par=0, if we sample long enough we can put 0 arbitrarily far into the tails of the posterior distribution. This is called “sampling to a foregone conclusion”; the theorem is the law of the iterated logarithm.
    
    Reply ↓
    - Corey on June 22, 2017 9:27 AM at 9:27 am said:
      
      (Er, with a flat prior.)
    - Daniel Lakeland on June 22, 2017 2:23 PM at 2:23 pm said:
      
      Yes, that may be true, but that’s the kind of statistical vs practical significance problem. In the case where you sample until you rule out zero, you will find that if p = 0 the high probability region you have by the point you finally rule it out is also extremely close to zero. If you define a practical equivalence to zero, such as p = 0 +- 0.1 you will not be able to rule out this possibility unless p really is of the order 0.1 (asserted without proof ;-) )
Anoneuoid on June 23, 2017 12:42 AM at 12:42 am said:

Corey “http://statmodeling.stat.columbia.edu/2017/06/18/dont-say-improper-prior-say-non-generative-model/#comment-512001>wrote”:

Bayesians love this one because you can make the stopping rule contingent on all sorts of things the experimenter might not actually be aware of. (For example, a samurai is hiding nearby and plans to swoop in and slice the coin in half with his sword on the second toss, but you stop on the first and he never gets a chance. Should this affect your inference?)

Not sure if I got your point but Bayes’ rule for N hypotheses labeled 0:N, given evidence E is:
P(H[0]|E) = P(H[0])*P(E|H[0])/sum( P(H[0:N])*P(E|H[0:N]) )

All hypotheses with relatively low P(H[i])*P(E|H[i]) can be dropped from the denominator as an approximation. It is just like x ~ x + 1/inf, which everyone accepts.

Anyway this is all fiddling while Rome burns as long as most people are testing “some other hypothesis” rather than their hypothesis.

Reply ↓
- Corey on June 23, 2017 1:36 PM at 1:36 pm said:
  
  You definitely didn’t the point of the scenario, probably because I didn’t actually spell it out in any depth (if it interests you, you could read the section David Mackay’s book linked above). My point was only that it’s a popular example among Bayesians and so it surprised me that Daniel didn’t recognize it in the course of his discussion with Carlos. In any event, my mental model of you is failing — I can’t suss out why you’d respond by giving me that approximation.
  
  …oh no, wait, I get it — the samurai scenario has negligible probability. Right, so that’s actually beside the point that I neglected to explain.
  
  Reply ↓
  - Anoneuoid on June 24, 2017 10:15 AM at 10:15 am said:
    
    After reading that I’m thinking why not use a more real life example? Say someone keeps collecting data (flipping coins) until they either get p < alpha or run out of money, vs someone who stops after a planned n. Further alpha is adjusted for each line of work depending on how much it costs to collect each datapoint. I mean that is what over 90% of researchers are actually doing…
    
    Reply ↓
    - Corey on June 25, 2017 3:55 PM at 3:55 pm said:
      
      A didactic example has to satisfy a number of criteria, and one of them is simplicity of the math when the math is beside the point.
    - Daniel Lakeland on June 25, 2017 5:24 PM at 5:24 pm said:
      
      Someone should have given me the memo
    - Carlos Ungil on June 25, 2017 5:39 PM at 5:39 pm said:
      
      Indeed.
Daniel Lakeland on June 23, 2017 1:18 PM at 1:18 pm said:

Carlos: not sure if you’re still paying attention here. but here’s the gist of my biggest concern:

To me, a stopping rule is deterministic, if, given all the information I have about it, I can assign 0 or 1 as the probability of stopping given the data, ie. p(Stop | Data, WhatIKnow) in {0,1}

There’s no other meaningful way to handle this in my Cox/Jaynes Bayesian conception, and so for some people they’ll call a rule deterministic and for some people they won’t, and it isn’t a function of the rule, it’s a function of what they know about the whole process.

So, the census goes out and surveys some region with 1000 census tracts. They publish a data set, and some notes on data collection. In the notes on data collection they say “we used the Proprietary SuperEnsemble(tm) method developed by our partners at Booz Allen Hamilton (motto: we haven’t been indicted… yet!) to stop sampling when we were virtually guaranteed to have sampled at least 10% of all people in this region”

Now, given what you and I know about the SuperEnsemble method, is it deterministic or “random”?

Imagine an alternative world, instead the census says “we did a preliminary study and performed a design calculation that determined that we should sample 3000 households” is this a “fixed N” or is it random? What if their preliminary study was to survey 1 census tract and calculate several sample statistics and do some mathematical calculations, and then round off, and the whole thing spit out 3000 ? Since we’ll assume the preliminary study wasn’t part of the final data set, then as far as I’m concerned, selection of this number 3000 was a random process that was dependent on the parameters of interest (because it was calculated from a sample from the population that isn’t part of the data). So even “sample 3000” can secretly be a random and parameter dependent rule.

Finally, assuming I can’t assign 0,1 to the stopping rule given the information I have, at least maybe I can determine if the stopping rule is dependent on the parameters of interest? Well, not until I have a model that tells me what the parameters are. Once I have that, then I need to do the best I can to guess about the stopping rule, and determine if given my knowledge, I might have assigned different probabilities of stopping if I were given the actual values of the parameters in my model than if I weren’t.

Of course, if you don’t take this cox/jaynes view, you might have a very different idea of what it means to be random vs deterministic.

My solution to all of this is to always model the generating process as best as I understand it. I don’t think I’ll go wrong with this. I’ll get the inference that corresponds to what I *really think* about the process. Whereas, if I repeat the mantra “stopping rules don’t matter” and I don’t do this full analysis of the generating process, then I’ll sometimes get the same thing that I would have, and sometimes I won’t.

Reply ↓
- Carlos Ungil on June 23, 2017 1:39 PM at 1:39 pm said:
  
  Have you understand my other replies or not? I have no interesting in discussing metaphysics, really. I’m just interested in the questions with a practical relevance that you seemed to struggle with.
  
  Reply ↓
  - Daniel Lakeland on June 23, 2017 2:03 PM at 2:03 pm said:
    
    Carlos: yes I understand that if you tell me detailed textbook type information about the stopping rule, I can then decide whether it’s deterministic or random, and if random whether it depends on the parameter given the data or not. At least I have a hope of being able to do that.
    
    What I still don’t know is whether any of that is helpful to me in applied problems where the stopping rule is vague, or may or may not involve unknown quantities from the population (like the “hidden” preliminary design study in the Census example above)
    
    These are really the cases that I actually run into. no one actually does textbook problems like “sample until you get 3 heads in a row” they all do things like “we first looked at XYZ and noticed that it had certain properties, and based on that, and cost considerations, we decided to sample N items, and after we did that we looked at the data and then we decided to sample K additional items in a certain sub-group… ” or whatever. And in these cases I think the “metaphysics” winds up mattering because it alters your data generating process, and whether that results in the same likelihood as if you’d made some alternative assumption, isn’t as clear cut and guaranteed as the textbook type example.
    
    BUT: YES please accept my thanks for helping to clarify how to analyze the textbook cases at least.
    
    Reply ↓
    - Carlos Ungil on June 23, 2017 2:47 PM at 2:47 pm said:
      
      A few comments ago you wrote:
      
      > If we go back to “it looked pretty consistent so we stopped” is this not like “get a kind of gut instinct point estimate of some parameter, then make a random decision about whether to continue or stop based on what that point estimate was” ? which is sort of along the same lines as “randomly generate a value q* from the posterior, and then rbinom(1,q*) decides your stopping”
      > This kind of thing is shockingly common, where a vague idea that the data looks ok based on some expectation of what you’re going to get leads you to randomly stop after you’ve fulfilled that expectation in some imprecise sense.
      
      Is this a textbook example now? Do we agree that if the stopping rule is not correlated to the parameters in the model it won’t be informative? Why was that a very important problem yesterday and it’s not representative of the real world issues today? It would help with the discussion if you were able to stay with one of your own examples for more than 30 seconds.
      
      In the “hidden” preliminary design study, the stopping rule is not informative because it’s fixed before the study and deterministic. I think the problem here is that your prior is not consistent with your knowledge if the N=3000 reflects some information that is not included in the prior. You could also think of the same study as beginning slightly before, with a stopping rule which is not fixed but established through a parallel survey. In that case it’s clear that the stopping rule is informative if that additional data is not included in the analysis.
      
      > no one actually does textbook problems like “sample until you get 3 heads in a row” they all do things like “we first looked at XYZ and noticed that it had certain properties, and based on that, and cost considerations, we decided to sample N items, and after we did that we looked at the data and then we decided to sample K additional items in a certain sub-group… ” or whatever.
      
      Again, is that stopping rule correlated to the parameters conditional on the data or not? (I hope the stars and capital letters are not required anymore)
      If you can’t tell, you don’t know if the stopping rule is informative. Your solution to all of this is to always model the generating process as best as you understand it, ok. But if your understanding doesn’t allow you to decide whether the stopping rule is informative or not, I don’t see how it can be a solution.
      
      > maybe I can determine if the stopping rule is dependent on the parameters of interest? Well, not until I have a model that tells me what the parameters are.
      
      Do you find that surprising? You cannot do much hard core Bayesian analysis until you have a model either.
    - Daniel Lakeland on June 23, 2017 2:59 PM at 2:59 pm said:
      
      “Is this a textbook example now? Do we agree that if the stopping rule is not correlated to the parameters in the model it won’t be informative? ”
      
      I think my point is that whether the stopping rule is random or not, and whether it’s correlated with the parameters of interest or not is *not a property of the stopping rule*
      
      it’s a property of your model of the data generating process.
      
      But, the goal of a definition like “a stopping rule which is not random, or is random and not correlated with the parameter of interest is uninformative” is to give you a method for quickly deciding whether to ignore the stopping rule.
      
      “we sampled 20 rooms by random number generator” sure sounds like “deterministic stopping rule N=20” but when you learn how the negotiations were done to arrive at the number 20 it sounds like “random number vaguely associated with cost to repair”
      
      and so, the heuristic of “first determine whether the stopping rule is random and probabilistically related to the parameter of interest, and then if it isn’t ignore it” is really no help at all. it is really equivalent to:
      
      “first create the full generative model include a generative model for N (which by the way is how we got onto this whole thread in the first place, Andrew talked about needing a model for N) and then analyze it by ignoring the definition of informative stopping rules because they won’t be relevant when you have the right generating process”
      
      By the time you’ve done all this work including modeling how N came about… you’re going to get the right answer and you’re not going to be able to answer the question of “informative or not?” until you do all that work anyway.
    - Daniel Lakeland on June 23, 2017 3:19 PM at 3:19 pm said:
      
      And, get this @Corey: if you do model the data generating process in such a way that you have different N possible, and you model the fact that N is chosen in some sense based on what’s expected about some internal parameters to achieve some organizational goal etc, then in fact you will get priors for things like cost that depend on N
      
      p(CostToRepair[5] | N[5]) p(N[5])
      
      because “the reason people chose N[5] is because they knew that the cost to repair things that had a visual score of 5 was probably pretty damn high”
      
      so in fact, yes, it could be very reasonable to use N to adjust your priors on other internal parameters in some models and this is particularly true when you have a kind of adversarial ulterior motive perception of the study, like for example in a drug approval setting or a clinical trial setting with authors who have a demonstrated bias in their prior papers, etc etc if they’re choosing N or censoring data points, or whatever for an ulterior motive then even their “we excluded everyone with systolic BP > 150” or “we randomized the full recruited population 2:1 into treatment vs control” could tell you something even though it looks to be deterministic at the surface.
    - Carlos Ungil on June 23, 2017 3:22 PM at 3:22 pm said:
      
      Of course the stopping rule (which is a property of the experimental design) will or won’t be informative in the context of the model used for the analysis of the data (which is not a property of the experimental design). We fully agree on that.
      
      If a non-informative rule (fixed at the time when the experiment is designed) is based on prior data, the easiest way to proceed may be to incorporate that prior data into your prior distribution and use a model with N fixed. But of course you run into the risk of using that data twice in that case!
      
      Is the data used to choose N completely included in the data you will analyse later? Then you can treat N as fixed without changing the prior.
      
      Is the data used to choose N completely absent from the data you will analyse later? Then you can treat N as fixed if you include that information in the prior.
      
      Is the data used to choose N partially included in the data you will analyse later? Then you could treat N as fixed but you have to “partially” modify the prior (difficult) or you could create the full generative model accounting for the partial overlap (not any less difficult, and I would say more difficult).
    - Daniel Lakeland on June 23, 2017 3:50 PM at 3:50 pm said:
      
      Good, perfect, now I think we’re in agreement.
      
      I think my confusion was thinking that somehow knowing the definition of “informative stopping rule” would allow an analyst to short-circuit certain difficulties in modeling, as in “oh, in this case I’ve got an uninformative stopping rule, so I can pretend N was fixed”.
      
      If “uninformative stopping rule” was a true property of the rule like “does or does not contain quantities of lead measurable with instrument X” is a real physical property of baby food… then you could first figure out what the “truth” was, and then use that truth to select your model.
      
      But, it doesn’t work like that. It’s just a classification system for after-the fact. After you are in a position to determine whether the inference depends on the details of the stopping rule and/or choice of N, you can classify yourself into “stopping rule was informative” or “stopping rule wasn’t informative”. Useless before doing models, but after doing models you can see that what you did was basically one or the other.
      
      I’m actually really happy with the idea that one way to deal with all of this is to look at the N and what you know about how it came about, and adjust your priors on parameters to explicitly account for what you think the N might tell you. That seems like a pragmatic useful rule of thumb, that it’s somehow legit to consider what caused the N in considering how to select the priors. I think that’s your 2nd and 3rd options.
      
      This has been really helpful to hammer out a very subtle idea that is not usually addressed at this level in textbooks etc. My discussion of “textbook” problems is meant to refer to where someone poses the problem in such a way that the property “informative or uninformative” *really is* decidable before hand because there is no ambiguity in the model that might make something informative to one person and uninformative to another. There is an unambiguous shared comprehensive set of knowledge thanks to the extremely precise statement of the problem (ie. “flipping a perfect bernoulli coin with constant p”)
      
      Once again Carlos, I really appreciate your persistence, even through what might have been some frustration. And I hope “crh” above and others get something out of this. In fact, I think I’ll put up a summary on my blog and point it at this thread.
    - Carlos Ungil on June 23, 2017 4:39 PM at 4:39 pm said:
      
      We also agree that the discussion has been interesting. However, I don’t understand why now you think that you cannot say “I’ve got an uninformative stopping rule, so I can pretend N was fixed” to simplify the modelling.
      
      In most of the cases we have discussed, from textbook problems like “stop when you got three heads in a row” to real world examples like “stop if the result at an interim analysis is statistically significant” or “it seemed like we had a pretty consistent result, so we stopped collecting data”, you thought that more complex models were required because the one with N fixed was not representing the data generation process properly.
      
      It seems useful to be able to use the simple model instead, knowing that it will result in the same likelihood function (and inference depends on data only through the likelihood). Of course that’s not true unless the dependence of the stopping rule on the parameters is only through the data, but that seems a reasonable assumption in many cases.
    - Daniel Lakeland on June 23, 2017 7:26 PM at 7:26 pm said:
      
      Carlos: I mean by you cannot say “I’ve got an uninformative stopping rule” until you know it’s uninformative, and that takes doing a fair amount of thinking about what might cause the stopping rule to be informative and whether those things are things that you think you need in your model. In the textbook case you are handed the model “the coin is a bernoulli coin with constant p and the stopping rule is FOO” here there is really no choice in the model. In the real world cases you have to say “why did they choose this rule for stopping? What does it say to me about the process? what aspects of the process might be correlated with this stopping rule? Are those things that need to be in my model either through the prior or through the likelihood?”… after you make all those decisions, you can say “this rule is uninformative” or “this rule is informative” but … it didn’t let you short-circuit all that modeling.
      
      In the case “things seemed pretty consistent so we stopped” depending on my knowledge about what is going on I may say “that actually lets me know that X might be true and X is correlated with the thing of interest” the likelihood or prior then gets altered to account for X and its affect on the experimenter “deciding that things seemed pretty consistent”.
      
      Even the bayesian test after each data point using model M1. This is deterministic given the data within M1, but it could be informative for model M2. If model M1 is that basketball shots are bernoulli random variables with constant p and the test is that the posterior probability p(success < 0.1 | M1) < 0.1, if my actual analysis model for this dataset is that p takes on one value initially and then there’s a change point where it jumps up when the player gets “hot”, the fact that they stopped the experiment is in some sense evidence that maybe the change point occurred already thereby biasing the inference for p under model M1 higher, and that’s obviously correlated with the parameters of interest, namely the change point and the size of the jump.
      
      so, no, I don’t think you get away with saying “it’s uninformative so I avoid bothering to modeling stuff” because its “uninformativeness” is only with respect to whatever model you eventually decide you need.
    - Daniel Lakeland on June 23, 2017 7:33 PM at 7:33 pm said:
      
      In the hot hand example, obviously conditional on the data M1 always gives the same answer. So it’s deterministic even within M2, the hot hand model, that is, within M2 we know that after seeing data D M1 will tell you to stop. But…. the *fact that M1 told you to stop after N* is itself *data* within M2 which is relevant to your inference about the position and size of the jump… this may have to enter into M2 via a choice of prior for the change point which is dependent on N for example, an idea we have both agreed is useful somewhere above.
    - Daniel Lakeland on June 23, 2017 8:07 PM at 8:07 pm said:
      
      One thing I don’t think we’ve mentioned is that you might look at the design of an experiment, see that there is a particular design, be it a stopping rule, or an N chosen by reference to a preliminary study, or an N chosen by decision by committee based on costs and benefits… or whatever
      
      See that there is this rule tells you something about what the experimenters were thinking or how the experiment might be different if some unknown took on different values, and then you *choose to add a parameter to your model*.
      
      This additional parameter, by itself, obviously alters the likelihood as now it’s a function over one (or several) extra parameters.
      
      In the preliminary census study example, you might want to model the bias that the preliminary study had. So you put in a parameter for this bias.
      
      Again, it seems like a mistake to ignore the modeling based on a “this rule is uninformative” type scenario, in part, you might need to add a parameter to *make* it informative.
    - Carlos Ungil on June 24, 2017 6:31 AM at 6:31 am said:
      
      I will ignore my stopping rule for a second to give a last comment, because it seems that you still don’t get it and that would be a pity after such a long discussion.
      
      > In the hot hand example, obviously conditional on the data M1 always gives the same answer. So it’s deterministic even within M2, the hot hand model, that is, within M2 we know that after seeing data D M1 will tell you to stop.
      
      We agree, the stopping rule is a deterministic function of the data (the sequence of success/failure events [ x1 x2 x3 … ]).
      
      > But…. the *fact that M1 told you to stop after N* is itself *data* within M2 which is relevant to your inference about the position and size of the jump…
      
      The fact that we stopped at N (using a stopping rule which is depends only on the data and obviously does not depend on the parameters of M2 given the data) is completely irrelevant for inference about the position and size of the jump.
      
      And I wouldn’t say that it takes a great amount of thinking to realise that the stopping rule is non-informative in this case.
      
      > this may have to enter into M2 via a choice of prior for the change point which is dependent on N for example, an idea we have both agreed is useful somewhere above.
      
      If you include the information about N in your prior, your likelihood will have to compensate for that somehow.
      
      If in the end you get a posterior using your modified prior and your full analysis of the generating process that is not the exactly the same as the posterior that you would get analysing the sequence with the “N fixed” assumption and the original prior YOU ARE DOING IT WRONG.
      
      I don’t really see the advantage in working more to arrive to a wrong conclusion (or in the best case the same conclusion).
    - Daniel Lakeland on June 24, 2017 10:33 AM at 10:33 am said:
      
      Carlos:
      
      You’re right, there’s something confusing me and I didn’t know what it was. So I thought about it in the shower this morning, which is always a good idea. And here’s what I came up with (and again, thanks for pushing me because otherwise I wouldn’t have figured this out).
      
      In all of this, there is a tendency on my part to try to map the “real” concern which is messy, to a simple description so you and I can discuss the simple description, but that’s imperfect and in this case, I didn’t even realize what my more real concern was. In this case, it seems to me that the real issue is that the real world cases involve me not knowing what the stopping model was *the stopping model is unknown*.
      
      The real concerns are things like the biologist saying “it seemed like we were getting the right stuff so we stopped” or “some lawyers negotiated that they’d be willing to pay for 20 samples” or “the census bureau used the SuperEnsemble method (details totally hidden)” or “the preliminary study said X” or “we were getting basically the same things that other labs were getting” or whatever. Again, real world stopping rules are often messy.
      
      Even if you tell me that N is a completely deterministic function of the data once the model was specified… I might want to know *what the heck was your model* and I might want to know this for various reasons:
      
      1) it’s hard to elicit numerical priors from people like biologists or lawyers or engineers (populations I work with ;-)). So, when they didn’t actually use a numerical calculated model (like some Stan code) but rather something that we can think of as similar to a model (say a vast wealth of personal experience in this area) the fact that N is a pseudo-deterministic function of the data given the model may be irrelevant for our likelihood in model M2 that we make, but it still means that we can infer something about the model inside the head of the experimenter. Again mapping things imprecisely, we could infer something about what kind of prior the experimenter has and therefore what we should really be using in our formal M2 programmed into Stan.
      
      2) Some of the projects I’ve worked on are legal settlement negotiations. Here, you’d like to know what the opposing party might be willing to settle for. The fact that they chose a particular N could give you information, such as “they’re really cheap, they don’t want to spend any money” or “they have deep pockets they’re willing to do plenty of investigation”. This might inform a prior on a parameter involving your estimation of what they will settle for.
      
      3) When the stopping rule is unknown (The SuperEnsemble method) the stopping rule could be a deterministic function of the data, or it could be a random function of the data, you just don’t know. But you might have some sketchy details about the kinds of things that go into the stopping rule. That could then allow you to infer some stuff about the stopping rule. So for example you could do Bayesian model selection in your model, one where N is essentially deterministic, and one where N is a random function and dependent on some aspect of additional data/information related to your parameters.
      
      Deep inside all of this has been a desire to help me figure out how to handle these sketchy cases, that are nevertheless pretty common when you’re working with small N variable outcomes, negotiations between parties, or experimenters with tons of background experience but no statistical modeling knowledge. Inventing parameters that describe the not totally known stopping rule and using them to help you figure out stuff in your own model is more like the real concern.
      
      So, summary:
      
      1) Data generating processes in the presence of stopping rules are different from when N is fixed because they handle the possibility that there might be various sized datasets etc… Nevertheless once you plug in N you will get the same likelihood in certain very common cases (called uninformative stopping rules).
      
      2) If the stopping rule was a known deterministic function of the data N, or a nondeterministic function of the data N but not related to your parameters, then it is called uninformative and you can safely ignore it in your analysis.
      
      3) even if the stopping rule, to a person who knows its details, is a deterministic function of the data N, if you don’t know what the stopping rule is precisely, to you it’s a random function and then if you want to know something about it, you can infer information *about it* from the N, what you infer may then inform your own model, and this is particularly useful in cases where the stopping rule is the kind of vague stuff like “it seemed ok” or “we negotiated for 20 samples” or “we used the SuperEnsemble method that our partners developed”. Note that in this case, you’re typically inventing parameters which describe the stopping rule, and so *the stopping rule is a random function related to your parameters*.
      
      Hey! I think I do finally get it.
    - Daniel Lakeland on June 24, 2017 10:50 AM at 10:50 am said:
      
      Really simple examples: stopping rule for biologist is “calculate the sample average and see if it looks ok” I map this to “calculate the sample average and see if it’s in the high probability region of my biologists prior for this parameter based on a vast literature review and experience in the lab” but biologist is incapable of describing prior, they have no idea what it means…
      
      so if there is some natural variability in the process and they didn’t stop earlier, then the running average wasn’t yet into their high probability region of their prior… so we can infer that their high probability region of their prior might not include the region of the running average for the first several samples but it does include the region for the last several samples.
      
      or similarly, maybe they have some knowledge of the measurement instrument, and they stop when the running average variability from the previous sample to the final sample is less than what they know the measurement bias might be… so the average has converged to about as good as its going to get given their knowledge of the possible sizes of the bias…. so I can infer what they think about the measurement bias.
      
      Or similarly, the cost tradeoff between the value of improving the estimate of the population average and the cost of running more experiments became in favor of stopping… so I can infer some information about how much they are willing to spend on collecting data given how much they think they might get out of having the data… (this is very relevant to lawsuits)
      
      Hey, that’s great, that’s exactly it!
  - Daniel Lakeland on June 23, 2017 2:49 PM at 2:49 pm said:
    
    Here is an actual example from a project I actually worked on (how practical can you get?)
    
    Several people survey all the rooms in a fire-damaged building rating the degree of damage in each one on a 0-5 scale with anchored descriptions of the meaning of the scale.
    
    After this, based on cost considerations we have two options:
    
    1) collect N data points with N determined by cost to investigate vs cost to repair selected randomly from among all rooms using a computer RNG.
    
    2) subset the rooms into those rated 0,1,2,3,4,5 and select N_i from each with the total number determined by cost to repair vs cost to investigate, and the sub-samples chosen by computer RNG from among those rated in each state.
    
    Now, client likes 2 better, and wants more samples in the 4,5 range because they want to nail down costs and these are the more variable and expensive conditions.
    
    Now, first off, is the choice of N random or deterministic? The truth is, the choice of N comes from some people sitting around a spreadsheet and saying “what if cost of repair of a 5 is C5 and cost to repair of 4 is C4, but cost to survey is ….” and then they negotiate together with their client, and everyone eventually comes down to “survey 20 total rooms over 2 days with a team of X people and we’re authorizing $Y for the survey”
    
    All I can say about that is that it is in fact based on some kind of cost model which is a kind of random variable that describes the interaction of all the different negotiations etc. And, by the way, it’s associated with cost to repair, and cost to repair is an important unknown parameter in the model.
    
    In part 2, not only do we have choice of N total somehow nebulously chosen and associated with cost to repair, but also we have how many to sample in each subset associated with the ratings, and there were several people who did the ratings, and later on in the model the biases of each individual will become a parameter we are going to care about, so that we’ll adjust all the ratings to some abstract average rating across the several surveyors so the number of rooms chosen to be investigated from within each rating category is probabilistically dependent on the biases of the individual raters.
    
    Of course, when this is written up it will be described like in the first case: “we chose 20 rooms by computer random number generator” and in the second case we could say “we chose 20 rooms by computer random number generator with 4 chosen from among those rated 0-3 and 8 chosen for rating 4 and 8 chosen for rating 5”
    
    Person Q the counter-party who reads that description will say “there’s no stopping rule, N=20 is fixed”
    
    but given the kind of information I’ve given you above, I certainly think it’s reasonable to call the choice of N random and dependent on the cost parameter.
    
    So, your mileage may vary… a lot.
    
    Reply ↓
    - Carlos Ungil on June 23, 2017 2:59 PM at 2:59 pm said:
      
      Assuming the choice of N doesn’t just come from people sitting around a spreadsheet but it’s based on some information relative to the damage in the building, did you include your information in your prior? If not, why not?
- Daniel Lakeland on June 23, 2017 2:16 PM at 2:16 pm said:
  
  @Corey: if you’re still following along here. maybe you have something to add more on the metaphysics. It seems to me like the preliminary census study to determine N = 3000 case is an archetype of a vast ocean of practical problems. People do prelim studies, they do some calculations, then they go out and do the big study, and the N they use for the big study looks like a fixed factor, but it might as well be considered the random output of f(PrelimData, DecisionParameters, CostConsiderations, …) some of which may eventually be highly correlated with parameters you eventually decide need to be in the analysis of the full dataset. (in census example you could think of say illegal immigrants who go out of their way to avoid the census which then makes the census preliminary study biased and this bias is something you’re trying to analyze, that seems not even a little farfetched)
  
  In this case, even what looks like a fixed rule “sample until you get 3000 measurements” is in some sense a random and correlated with stuff you care about rule, because whether you choose 3000 or 3300 or 2700 or whatever is dependent on what information went into that preliminary analysis.
  
  Reply ↓
  - Corey on June 23, 2017 4:09 PM at 4:09 pm said:
    
    Like AG said, the math is in BDA. IIRC (and that’s a medium-sized ‘if’), as long as you condition on the preliminary study data as well as the main study data (and why wouldn’t you?) you’re fine. That immediately implies that if N is large enough that the main study would swamp the information in the preliminary study (in the same sense that we speak of the likelihood swamping the prior) then how N was determined ceases to be of practical importance.
    
    Reply ↓
    - Daniel Lakeland on June 23, 2017 7:54 PM at 7:54 pm said:
      
      Sure, this part about the data swamping the prior is fine for large N, and conditioning on the preliminary study is basically exactly the solution Carlos and I have come up with above, and putting this conditioning into the prior will often make good sense. That’s especially true if your knowledge of the preliminary study is minimal (ie. I know they did a prelim study, but I don’t know how big, nor what the criteria were in the calculation, but I have a guess that it’s probably n on the order of 100 or so rather than thousands…)
      
      The bigger issue comes in when N is small, and that’s really common in stuff I work on. Biologists want to look at 3 or 5 or 25 animals, not 3500, and they first look at 3 or 4 of them to “try out their technique” without really recording those because they weren’t “really part of the study since we hadn’t really determined what surgery/drug/measurement technique we were really going to use”… and so those first 3 inform them that “stuff seems to be working” and then they collect 5 more with “a consistent technique” and then they want to know how “doing x” affects “outcome y” and then when you tell them that the posterior distribution isn’t all that concentrated, they say “well the original student graduated, but if I have my new grad student do 5 more will that be enough?”
      
      And I guess my intuition is just “model all of it” and be ready to put some portion of the model, whether you classify it as likelihood or prior, that is potentially N dependent, or to add parameters that alter your likelihood function which express aspects of what caused you to choose the N or the experimental technique, or whatever.
    - Corey on June 23, 2017 11:42 PM at 11:42 pm said:
      
      Ah, biologists.
    - Anoneuoid on June 24, 2017 12:49 PM at 12:49 pm said:
      
      The best part is all the cell/molecular bio jargon he spews is based on the same type of studies being made fun of. Not sure if that was intentional.
    - Daniel Lakeland on June 24, 2017 8:17 PM at 8:17 pm said:
      
      I’m sure yes.
    - Corey on June 25, 2017 3:58 PM at 3:58 pm said:
      
      The best part for me is that while that’s clearly intended to sound like bafflegab to statisticians, my background in biochemistry allows me to understand it and know that it actually makes reasonable sense in context.
    - Glen M. Sizemore on June 24, 2017 11:51 AM at 11:51 am said:
      
      “The bigger issue comes in when N is small, and that’s really common in stuff I work on. Biologists want to look at 3 or 5 or 25 animals, not 3500…”
      
      GS: Just out of curiosity, are these biologists asking questions that are relevant to groups or to individuals?
    - Daniel Lakeland on June 24, 2017 8:16 PM at 8:16 pm said:
      
      Glen, it depends a lot on the study, but often it’s really groups. I still encourage people to do time-series models in which they observe the way in which individuals respond through time (say multiple measurements of healing through time for example), but lots of the experiments require you to say harvest embryos at a certain point in development, so you can’t ever get more than one observation from an individual.

Statistical Modeling, Causal Inference, and Social Science

Don’t say “improper prior.” Say “non-generative model.”

125 thoughts on “Don’t say “improper prior.” Say “non-generative model.””

Leave a Reply Cancel reply