I give you that it is quite esoteric. I will try to unwind some peculiarities.

The exps in the function “pYes” are so that par[1] and par[2] would always be positive. Conversely in the vector priorMeans the means are log’d so that they would be more easily understood–but maybe they aren’t. The core equation in itself–(x/alpha)^beta–is widely used in psychophysics to relate physical signal level to the internal signal-to-ratio (cf. e.g. Kontsevich och Tyler from the previous post). Intercept in the model corresponds to a “decision criterion”–if you think it as a latent variable model (https://en.wikipedia.org/wiki/Logistic_regression#As_a_latent-variable_model).

The logs in the function informationGain are related to the calculation of the entropy of a bernoulli distribution. As I said, the heuristic is to minimize the entropy of the posterior distribution. Here, instead, it is the probability distribution of the responses in which the entropy is minimized; the algorithm (during the optimisation step inside the main loop) chooses the stimulus that reduces the entropy of the bernoulli distribution the most. Kujala and Lukka have more information about this in their article.

The arbitrary constants in pYes are indeed arbitrary. The constant 0.98 and conversely 0.02 is mixing coefficient; it denotes how much of the response is dictated by the core equation and how much of it is due to unbiased noise (the coefficient 0.5). This principle is elaborated on in the Zeigenfuse reference. Also I think Kruschke wrote about this sort of mixture modeling in his book, calling it “robust regression”, but I can’t put my finger on it.

]]>My ever-present position is to test a model you have derived from a theory/explanation/whatever and then work from there.

]]>In the end, hypotheses about which rng generated your data are stupid things to test. What we want is mechanistic predictive models with bounds on the imprecision. That’s what Bayesian gives you.

]]>1. HA: ‘treatment’ is better than baseline

2. HA: ‘treatment’ is worse than baseline

[…]

So yes, the null hypothesis of no effect is often a priori false, however, the one sided nulls are not false.

Not at all. That is not the null hypothesis, alternative hypothesis, or any hypothesis being tested. Your null hypothesis is whatever model you actually calculated a p-value (or whatever) based on. This will include other assumptions besides mean1 = mean2, such as normality, iid data, etc. **There is no reason to privilege the mean1 = mean2 assumption.**

In this sequential sampling case, the iid assumption that a lot these default statistical models make is violated. I think people don’t even understand the first thing about what they are testing, which leads to all these problems. I know a lot of readers probably think I am hyperbolic but it really is idiotic if you understand what is going on:

]]>That’s all fine, except (a) effects can be highly variable, hence an effect size of +0.002 in a particular experiment, even if several standard errors from zero, doesn’t tell us much about what might happen next time (I’m assuming a scale in which effect sizes on the order of 0.1 are interesting); and (b) all this type 1 error rate control stuff is not really relevant to questions of distinguishing positive from negative effects.

]]>1. HA: ‘treatment’ is better than baseline

2. HA: ‘treatment’ is worse than baseline

A point null hypothesis test tests both of these controlling for the family wise error rate. This actually corresponds to how people actually interpret the results. People don’t say: “The treatment was shown to have a non-zero effect on the condition,” rather they interpret it directionally: “the treatment was shown to improve the condition.”

So yes, the null hypothesis of no effect is often a priori false, however, the one sided nulls are not false. So here is a counterintuitive response to your statement that “the null hypothesis is false.”:

The null hypothesis is true. The question the test is addressing is which one.

]]>If I had the time to waste I could redo this with a rejected study with frequency based analyses where the reviewers stated that the study was too noisy (under powered) had not adjusted properly for multiple analyses. The resubmit would do a Bayesian analysis with a flat prior and prattle on how about the advantages of now knowing the posterior probabilities highlighting credible intervals that are almost identical to the previous confidence intervals.

]]>I think the thing with the p-values is irrelevant to good practice in that we should not be using p-values to make inferences or decisions. I disagree entirely with your “false discovery rate” attitude in that I do not think the purpose of a study is, or should be, the “discovery” of nonzero differences. All differences are nonzero. Just get N=10^6 and you can get as many discoveries as you want.

Regarding the point estimates: yes, any selection on statistical significance will bias your point estimates. This arises with sequential or non-sequential designs. However, if you perform a sequential design and report all your data, there should not be a problem.

In addition, I disagree completely with your conclusion that a researcher should “increasing your sample size in small bits until you meet some threshold.” It’s always better to get more data. The reason for not getting more data is some combination of cost, convenience, and urgency—not a statistical significance threshold. Again, the null hypothesis of exactly zero effect and zero systematic error will never be true, so I have no interest in rejecting it 5% of the time or whatever. This is a game that I have no interest in playing, and which I don’t think researchers should be playing. And, for that matter, I don’t think Alan Turing used statistical significance thresholds when cracking codes (or, at least, I haven’t heard of him doing so).

]]>No, that is not correct. See my above post.

]]>Or the funniest:

]]> Bill Drissel

Frisco, TX

I’m happy to be basically guaranteed to reject the null hypothesis. The null hypothesis is false.

]]>Andrew’s argument, if I understand correctly, is: ‘Whatever, NHST (frequentist and bayesian) is useless and broken so who cares if you do the sequential analysis wrong.”

I think you can make the argument that NHST has serious problems, as Andrew often does. Whatever your bottom line decision rule should be, which has always been less clear to me in Andrew’s writing, you’ve got to correctly account for your sequential design. Sometimes you get this for free by virtue of being in a Bayesian paradigm and sometimes you don’t.

]]>I wrote a blog post outlining the consequence of sticking to NHST and not adjusting for sequential data collection. I hope it can help as an eye opener to some, as it clearly shows how large the bias in the p-values *and* the effect size estimates is when applying this approach:

http://blog.casperalbers.nl/science/statistics/the-problem-of-unadjusted-sequential-analyses/

I have a small disagreement with this statement.

(A) It IS useful to learn about an effect size being small

(B) The usefulness of (A) is predicated on having a large enough sample. And one way that will occur is if your ‘true’ effect size is very small and you have a statistical significance based stopping rule.

So while I agree that p-value based stopping rules are not a generally coherent framework, a side effect of implementing them is that a ‘precisely estimated zero’ obtained from doing so is quite useful. Think of this as the inverse to the type-M problem.

]]>The motivation for this is, at least it used to be, rather practical: if we are interested in, e.g., the faintest stimulus the subject can detect, it doesn’t really make sense to present them with stimuli they always are able to notice. This resulted in different sorts of “non-parametric” sequential tests, in which some simple rule would be used to determine the next stimulus. Later, as was said in the beginning of this post, more mathematical methods for stimulus selection were developed, since in the more complex models the stimulus placement is dependent on more things than just the psychophysical threshold.

To make everyone more bored, I’ve attached a quickly put together R code of a simple adaptive psychophysical task. I scripted it while on a tea brake, so it lakes a certain robustness in programming sense… but still, I thought that maybe people could find it fun to play around with it. It uses sequential importance sampling, at this point, so the particle degeneracy can become a problem if one wants to run longer simulations. In these cases I’d recommend one to add a “resample-move” step, as in Chopin (2002).

Also, since it is all in native R, and I was too lazy to figure out some vectorizations, it is also really slow, so be aware. The model in itself is quite simple. There’s an observer making binary decisions, basing their decision on the “internal” strength of the signal (depends on where signal-to-noise ratio is 1 and non-linearity of the internal scale) and a decisional bound, pretty much like in basic probit models. The probability is “padded” a with by mixing in some non-cognitive factors (like in Zeigenfuse and Lee 2010, if I recall correctly). So there it is.

References:

Chopin, N. (2002). A sequential particle filter for static models. Biometrika.

Dimattina, C. (2015). Fast Adaptive Estimation of Multidimensional Psychometric Functions. Journal of Vision.

Kontsevich, L.L and Tyler, C.W. (1999). Bayesian Adaptive Estimation of Psychometric Slope and Threshold.

Kujala, J.V and Lukka, T.J. (2006) Bayesian Adaptive Estimation: the next dimension. Journal of Mathematical Psychology.

Shen, Y, and Richards, V.M. (2013). Bayesian Adaptive Estimation of the Auditory Filter. Journal of the Acoustical Society of America

Zeigenfuse, M.D. and Lee, M.D. (2010). A General Latent Assignment Approach for Modeling Psychological Contaminants. Journal of Mathematical Psychology.

APPENDIX (CODE CODE CODE AAH)

# Some Functions pYes = function(x, par) { 0.98 * pnorm(-par[3] + (x / exp(par[1])) ^ exp(par[2])) + 0.02 * 0.5 } informationGain = function(stimulus, particles, weights) { pyes = rep(0.5, length(weights)) sum1 = 0 sum2 = 0 for(i in 1:length(weights)){ pyes[i] = pYes(stimulus, particles[i,]) sum1 = sum1 + pyes[i] * weights[i] sum2 = sum2 + (-(pyes[i] * log(pyes[i]) + (1 - pyes[i]) * log(1 - pyes[i]))) * weights[i] } sum1 = (-(sum1 * log(sum1) + (1 - sum1) * log(1 - sum1))) return(-(sum1 - sum2)) } # Particle set priorMeans = c(log(2), log(1), 1.2) priorSd = c(1, 1, 1) nParticles = 1000 particles = matrix(NaN, ncol = 3, nrow = nParticles) particles[,1] = rnorm(nParticles, priorMeans[1], priorSd[1]) particles[,2] = rnorm(nParticles, priorMeans[2], priorSd[2]) particles[,3] = rnorm(nParticles, priorMeans[3], priorSd[3]) weights = rep(1 / nParticles, nParticles) # Parameters for the simulation nTrials = 100 answers = c() stimuli = c() generatingValues = c(1, 0.5, 1) # Run simulation: for(t in 1:nTrials) { # Choose stimulus: stimuli[t] = optimise(informationGain, lower = 0, upper = 10, particles = particles, weights = weights)$minimum answers[t] = rbinom(1, 1, pYes(stimuli[t], generatingValues)) # Update prior for(i in 1:length(weights)) { weights[i] = weights[i] * (answers[t] * pYes(stimuli[t], particles[i,]) + (1 - answers[t]) * (1 - pYes(stimuli[t], particles[i,]))) } weights = weights / sum(weights) }]]>

This strategy gives you a significant result 98.8%(!) of the times if there actually is no effect.

https://pbs.twimg.com/media/DcL81C-W0AAuPk5.jpg:large

This is wrong, groupA and groupB should be initialized inside the outer loop. As it is now they grow to very large sample sizes. I get ~20% significant results for that scenario.

]]>I had very much the same experience with a statistical colleague a couple months ago. Before and afterwards, I sent them some material on how bad this actually is. Have no idea what the impact was/will be.

Largely, I think it is bad meta-physics or meta-statistics at the root of this and why it is so hard to get folks to take criticism seriously. For instance, the likelihood principle, to some means frequency properties are irrelevant so they will just dismiss looking at frequency properties.

If you can get someone’s attention and time, this simulation based exposition of the issue by Andrew may be a good bet http://andrewgelman.com/2016/08/22/bayesian-inference-completely-solves-the-multiple-comparisons-problem/

I discussed it in a wider context here (where it is Case study 1) http://andrewgelman.com/2016/08/22/bayesian-inference-completely-solves-the-multiple-comparisons-problem/

]]>This certain would make sense in clinical research for instance trying to carefully balance the control and intervention groups. But again, primary motivation being to evaluate feasibility, safety, compliance, timing, costs, etc. ]]>

https://pastebin.com/EcbU8sC7

Here are the results of the above for 100 simulations. It is the distribution of p-values you get by taking either the final or lowest p-value. About 35% were less than 0.05 in that case:

https://image.ibb.co/mZU6KS/seq_sample.png

If you didn’t do sequential sampling then those histograms would look like uniform distributions and ~5% of p-values would be below 0.05.

]]>“We continuously increased the number of animals until statistical significance was reached to support our conclusions”

“We assumed the data was iid but then made collection of new data dependent on the outcome of the previous data so it wasn’t iid. Then we rejected the iid model and concluded that we know the cure for cancer (or whatever).”

That is how stupid this is. And do not be mistaken, it is widespread and has been for decades. The main problem at this point is that the human mind recoils at the thought of the consequences.

]]>You do make a great point that had this study used an a-priori fixed sampling procedure instead of a post-hoc sequential one, it would not have been much better. While that is true in this case, in many other cases this does not hold. As such, I do think that it is good to focus (to some degree) on this particular bad approach, without losing sight of other problematic practices.

]]>Just to be clear:

– Bayesian methods don’t require an adjustment for sequential design. They do, however, require a model for the outcome that includes all variables used in the design (in the case of a sequential design, the key variable is time).

– I think decision rules based on null hypothesis significance testing (these include p-values, Bayes factors, and decisions based on whether a 95% confidence or posterior includes zero) make no sense and will in general have bad statistical properties.

– I think it’s a mistake to think you’ve “won big” if you get a huge effect size estimate along with a large standard error. I discussed this problem in section 2.1 of the paper linked to above.

– I disagree completely with the claim that Bayesian analyses with vague priors have great frequency properties. I’ve talked and written about this a lot: Bayesian analysis with vague priors leads to the following sort of statement: If you have an estimate that’s 1 se from zero, you end up with 5:1 odds that the true effect is positive. Go around giving 5:1 odds based on pure noise and you’re gonna lose a lot of bets.

– The likelihood principle is what it is. In any case, you can do most of the above reasoning without worrying about the likelihood principle, just looking at frequency properties.

– The practical effect I’m hoping from this post is for people to focus on important statistical issues. To criticize the above-linked study based on its sequential design is, to me, ridiculous, as it would have almost all the same problems had the sample size been fixed. The sequential design is a minor part of the study, and to pick on that seems to me like a distraction. For influential people including the editor of a leading psychology journal) to focus on this seems to me to miss the point, and it’s perhaps one clue how so many crappy papers get published in top journals: there’s an attitude that if various arbitrary rules are followed (no sequential design, p less than 0.05, etc.), that a paper gets to be published. That led to the Bem ESP debacle.

]]>As long as it's done this way, this approach is a double-whammy, you get the chance to "win big" with a huge effect size early on based on a type SM error and late on based on showing an irrelevant effect.

In my experience, when I tried to discuss the potential issues (which admittedly are more to do with the significant or not interpretation rather than the sequential data collection), I just get told that I am too stupid to understand the likelihood principle (and that I should read some article by Berry and Berry that explains it even for people like me). So these kind of posts do really worry me in terms of the practical effects they have, although here at least the "this is fine in a sense, of you don't care about the type 1 error rate or other frequentist operating characteristics"-disclaimer is clear. However, I would have wished for it to be even bigger and more clearly spelled out. One can never be too clear.

]]>Yes, I agree, your scenario is different than the sort of pilot study we see in statistics where the range of the data are pretty much known ahead of time (a simple example being binary data with a roughly known frequency).

]]>The pilot study might cost $800, and collect very basic info about failure types and soforth, but the final study will need to look at N windows, with pressure tests or chipping away stucco to reveal installation techniques, or whatever. Maybe $5000 per window by the time scaffolding is set up, and 4 or 5 simultaneous workers per window.

You definitely don’t want to do some kind of industrial process control textbook formula for sample size and tell the inspection team to completely strip 400 windows out of the building at a total cost exceeding twice the quantity requested in the settlement discussions.

A real-world cost based decision analysis is a real thing here, and even low grade biased measurements from a pilot of 8 windows is a damn site better than any other technique for determining the sample size.

]]>it’s not just that. Given that variance will be so high with small N, there’s not really any point in working to control bias at this stage. That’s one reason that pilot studies are often not randomized. Or, if they are, the point is to check that the randomization is feasible, not to worry about balance in some group of 4 patients or whatever.

]]>unless you’re doing it explicitly to make a decision about the size of your follow-up study, in which case you should try to get the best data you can so you can feed it all into a decision analysis.

I think normally this kind of decision analysis based followup isn’t done, and that’s why people put less effort into their pilot studies.

]]>I’d need to see the example. In any example I’ve ever seen, the interval based on 3 or 4 pairs is so wide as to include huge swathes of completely unrealistic parameter values.

]]>Based on my experiences, this is not true. I’ve seen lots pilot studies that are something like matches pairs with 3 or 4 pairs. To claim that the researchers prior sd about the effect size that was considerable less than half of the sd of the difference of a randomly selected pair was definitely not the case.

To be clear, I’m not talking about ideal worlds here.

]]>Sure. But another way of putting it is that, if you have any kind of reasonable prior, the likelihood from your pilot study will be so weak as to have virtually no effect on your posterior. And, in addition to that, a pilot study will typically have lots of bias: you’re doing the pilot to make sure the treatment can be implemented as planned, and there’s no real reason to put lots of effort into controlling biases.

]]>draw N posterior samples, for each i = 1…N generate a fake dataset of Q data points according to the generating process, then run a Bayesian inference on this fake dataset, and determine some posterior samples for k, an important parameter.

Vary Q

Using a utility function that encodes how much you really care about knowing the best value for k, choose a Q that maximizes your expected utility across the N possible parameter vectors.

Now carry out your “real” study using sample size Q

And *that* is what “Bayesian Power Analysis” should look like, and it *doesn’t* suffer from the “noisy point estimate” problem.

]]>Of course don’t use a flat prior for the design. Use prior information. There’s always prior information, otherwise why are they doing the experiment in the first place?

]]>1. Yes, with some sequential designs you can increase the probability of getting statistical significance. So what? Statistical significance, by itself, tells us nothing.

2. I disagree with your statement that, “in theory, a pilot test might be a good way to generate an estimate for a priori power analysis.” Even in theory the pilot study is a bad way to generate this estimate. It will be too noisy. The lower limit of an 80% interval from a pilot study is not a conservative estimate of anything; it’s just a random number!

]]>Honesty and transparency are not enough.

]]>In the end I really think just *stop using statistical significance* in any way, unless you whole goal is actually to test computer random number generators.

No one is going to start out to do a test saying “I’m going to sample until I get statistical significance” and actually do it if it takes more than a very moderate amount of resources. So the concept of “sampling to significance” is a purely theoretical one for almost all researchers. And if you’re doing research and you decide “our sample doesn’t give us significance, let’s collect a little more data and see if it does” then *you’re just doing it wrong*, NOT because you are sampling wrong, but because you’re ANALYZING wrong.

]]>As far as I’m concerned, a pilot analysis is “for” showing that the experiment is feasible, but after collecting the data, there’s absolutely no reason not to get a Bayesian Posterior Distribution from the pilot data and use it to form an informed prior for the full analysis.

NHST + power analysis is just again almost always going in the wrong direction. In a Bayesian analysis a “power” analysis is all about the question of “how much data would I have to collect to make a sufficiently low-risk decision” instead of “false positive” or “false negative” dichotomization.

]]>